Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Using Structured Knowledge Sources for Domain-Specific Ontology Extraction

IP.com Disclosure Number: IPCOM000188568D
Original Publication Date: 2009-Oct-14
Included in the Prior Art Database: 2009-Oct-14
Document File: 1 page(s) / 99K

Publishing Venue

IBM

Abstract

Disclosed is a method that extracts ontologies for a domain using a domain-specific corpus of unstructured text documents and structured knowledge sources that are emerging on the Internet. Examples of structured knowledge sources include electronic encyclopedias (such as Wikipedia, Freebase, DBpedia), electronic dictionaries (such as Wordnet, Wiktionary), and online glossaries (such as Webopedia). The extracted ontology consists of a set of domain-specific terms annotated with their respective types.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 100% of the total text.

Page 1 of 1

Using Structured Knowledge Sources for Domain -Specific Ontology Extraction

The invention uses a set of structured knowledge sources and an unstructured text corpus to extract a domain-specific ontology. The method uses the following steps to perform ontology extraction:
1. Extract domain specific terms from the corpus, using any standard natural language processing technique.
2. Identify the domain specific terms that have unambiguous matches in the structured knowledge sources. Refer to these terms as "seeds".
3. Group the domain specific seed terms found in (2) by their types
4. Find the types with the most frequent occurrences in the corpus.
5. Use these types to disambiguate all terms in the corpus with multiple types in the structured knowledge sources.

The figure above describes the entire ontology extraction process.

The approach described is closely related to, but different from, traditional Named Entity Recognition (NER). The main difference is that NER is typically restricted to a fixed set of high-level types, whereas the described process is unrestricted in its type coverage, as it relies on external online structured knowledge sources. Also, domain-specific terms in a corpus are typically found using term frequency information, whereas the described process uses type information as an additional dimension to filter domain-specific te...