Knowledge-Driven Information Extraction from the Web

This project focuses on the extraction of structured consumer electronics information from Web documents. The vision of this project is to be able to classify a Web page as containing information about a consumer electronic device, identify all of the possible classes of information on that Web page, disambiguate and match them with the most suitable concepts in a predefined and accepted ontology, identify the properties of those concepts within the Web document and extract property values using attribute-value analysis. The extracted information will be time and location sensitive and will serve as the knowledge layer that allows for information aggregation and reasoning in an electronic commerce aggregator Web site, which is the target business domain of SideBuy. The following objectives are central to our research project:

1)    Developing link prediction and Web page classification technology that are able to identify the relevance of each Web page content to a coherent subset of an ontology and measure its suitability for information extraction;

2)    Devising appropriate Web page segmentation techniques that are able to dynamically identify coherent chunks of textual information within the content of the Web page and segment the text into non-overlapping concept-corresponding subsets;

3)    Formulating automatic mapping technology that would identify the most relevant ontological concept for a given chunk of text and draw correspondences between the textual descriptions and the concept properties in such a way that the values for concept properties are derived from the text;

4)    Creating a maintenance framework that is able to identify inconsistencies between the extracted information and the ontology structure, and also detect conflicts between the newly extracted information and those already extracted and stored in the knowledge base, and possibly provide resolutions based on context-dependent strategies.


Natural Sciences and Engineering
Research Council of Canada