XemanticA: 2010

Saturday, February 20, 2010

Evolution of XemanticA begins

As Erangi said, what we have currently in our search engines is merely a syntactic search which the search is based on a key word matching. But the concept of Semantic Web has extended this existing method of Syntactic search to a Semantic search. For that do be done, the entities or Semantic Meta Data of the existing web documents should be identified. Currently there is no such effective or optimum method to extract text from document repositories. Therefore the intention of XemanticA framework is to implement an efficient, optimum and an accurate methodology to do text extraction from documents and store the extracted text entities.

We do have several modules in our XemanticA framework namely the Text Convertor, Specific Extractor, Generic Extractor, Comparator and Semantic Server.

Sajith has mainly completed the Text Convertor component which converts a given text document into a normal text string. This module is capable of converting doc, xls, ppt, docx, pptx, xlsx and pdf documents into a text stream. Apache POI library is used to convert all formats of documents except pdf which is converted using PDF Box. Then the stop words and other repeating characters are removed from the text stream which is then sent to entity extraction.

Specific Extractor is the heart of XemanticA and it is where we have to engage in a lot of research work to select a suitable algorithm for the implementation of automatic text extraction. LSA is being used here and still this is in research stage. More details about this component will be brought to you in future posts.

I started to work on the Generic Extractor where a given text document should be checked against Dbpedia, the online RDF data entity set and identify the document’s entities. Even though we thought that it is possible to check any kind of a document against Dbpedia, when I started actual implementation only I realized that the Dbpedia data set can be only checked against Wikipedia articles, but not against any other externally provided article. Then we started to get the functionality of an existing framework that uses Dbpedia data set through a Web Service API. Now the basic functionality of the Generic Extractor also is more or less complete.

Now we have to focus on the Semantic Server which should be capable of storing RDF data entities in the Sesame Server.

Sunday, February 14, 2010

How it all began ....

First of all I must mention that this is a blog that should have been written way back.... kinda like july last year.... but they say it's better late than never , so here we go with the first post :)

I am a final year undergraduate at the University of Moratuwa in the Department of Computer Science and Engineering. This blog is about the Final Year project that I am currently working on with my project team Manisha, Sajith and Rohan.

The project idea is to simulate the natural process of human understanding on reading a document. That is, once text is provided, using a framework we should be able to use its knowledge, to identify places, people, animals and other objects that are mentioned in that text. Basically, we wanted to build a framework that can be used in custom applications to extract semantic data entities from document repositories of diverse formats.

To understand what I mean by the natural process of human understanding on reading I will give you a simple example, If you given the word "Grenelle”, 99.9% of the time, you won't know what it is. But, if you are given a sentence, like, "Able used to live in Grenelle" you would guess that Grenelle is a city or town or country. Ultimately you would understand it as a place. On the other hand, if you are given a more descriptive piece of information, like, "Able used to live in Grenelle and often used to visit the Eifel Tower on Friday nights." Then you would have an idea, that Grenelle is a city or a town in France. This is the thinking ability of a human being. Once a sentence or a piece of text is given as an input to the human brain, it processes the semantics of the whole sentence. It identifies new words, using the other known words in that input. The correctness of the identification depends on the information in the input statement.

Thus our project idea, XemanticA ( eXtracted sEMANTIC data Analyser) is now on its way to achieve this goal.

XemanticA

Saturday, February 20, 2010

Evolution of XemanticA begins

Sunday, February 14, 2010

How it all began ....

Followers

Blog Archive

Contributors