Using Structured Knowledge Sources for Domain-Specific Ontology Extraction
Original Publication Date: 2009-Oct-14
Included in the Prior Art Database: 2009-Oct-14
Disclosed is a method that extracts ontologies for a domain using a domain-specific corpus of unstructured text documents and structured knowledge sources that are emerging on the Internet. Examples of structured knowledge sources include electronic encyclopedias (such as Wikipedia, Freebase, DBpedia), electronic dictionaries (such as Wordnet, Wiktionary), and online glossaries (such as Webopedia). The extracted ontology consists of a set of domain-specific terms annotated with their respective types.
Using Structured Knowledge Sources for Domain -Specific Ontology Extraction
The invention uses a set of structured knowledge sources and an unstructured text corpus to extract a domain-specific ontology. The method uses the following steps to perform ontology extraction:
1. Extract domain specific terms from the corpus, using any standard natural language processing technique.
2. Identify the domain specific terms that have unambiguous matches in the structured knowledge sources. Refer to these terms as "seeds".
3. Group the domain specific seed terms found in (2) by their types
4. Find the types with the most frequent occurrences in the corpus.
5. Use these types to disambiguate all terms in the corpus with multiple types in the structured knowledge sources.
The figure above describes the entire ontology extraction process.
The approach described is closely related to, but different from, traditional Named Entity Recognition (NER). The main difference is that NER is typically restricted to a fixed set of high-level types, whereas the described process is unrestricted in its type coverage, as it relies on external online structured knowledge sources. Also, domain-specific terms in a corpus are typically found using term frequency information, whereas the described process uses type information as an additional dimension to filter domain-specific te...