As Erangi said, what we have currently in our search engines is merely a syntactic search which the search is based on a key word matching. But the concept of Semantic Web has extended this existing method of Syntactic search to a Semantic search. For that do be done, the entities or Semantic Meta Data of the existing web documents should be identified. Currently there is no such effective or optimum method to extract text from document repositories. Therefore the intention of XemanticA framework is to implement an efficient, optimum and an accurate methodology to do text extraction from documents and store the extracted text entities.
We do have several modules in our XemanticA framework namely the Text Convertor, Specific Extractor, Generic Extractor, Comparator and Semantic Server.
Sajith has mainly completed the Text Convertor component which converts a given text document into a normal text string. This module is capable of converting doc, xls, ppt, docx, pptx, xlsx and pdf documents into a text stream. Apache POI library is used to convert all formats of documents except pdf which is converted using PDF Box. Then the stop words and other repeating characters are removed from the text stream which is then sent to entity extraction.
Specific Extractor is the heart of XemanticA and it is where we have to engage in a lot of research work to select a suitable algorithm for the implementation of automatic text extraction. LSA is being used here and still this is in research stage. More details about this component will be brought to you in future posts.
I started to work on the Generic Extractor where a given text document should be checked against Dbpedia, the online RDF data entity set and identify the document’s entities. Even though we thought that it is possible to check any kind of a document against Dbpedia, when I started actual implementation only I realized that the Dbpedia data set can be only checked against Wikipedia articles, but not against any other externally provided article. Then we started to get the functionality of an existing framework that uses Dbpedia data set through a Web Service API. Now the basic functionality of the Generic Extractor also is more or less complete.
Now we have to focus on the Semantic Server which should be capable of storing RDF data entities in the Sesame Server.
