Browse Prior Art Database

Automated Annotator Creation Based On Comparison of Corpora

IP.com Disclosure Number: IPCOM000236406D
Publication Date: 2014-Apr-24
Document File: 3 page(s) / 65K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system is disclosed for automatically creating one or more annotators by contrasting a field-specific corpus of data against another non-field-related corpus.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 3

Automated Annotator Creation Based On Comparison of Corpora

Disclosed is a method and system for automatically creating one or more annotators by contrasting a field-specific corpus of data against another non-field-related corpus.

In an embodiment of the present invention, the method and system creates one or more annotators by creating a field-specific word list/dictionary utilizing two or more corpora. For example, the method and system may utilize a control corpus and a target corpus in order to create a field-specific dictionary. The control corpus and the target corpus can be in at least one of an unrelated field so as to minimize overlap of terminology and a related field in order to further refine results for creating the field-specific dictionary.

Figure 1 illustrates a flowchart describing the steps for creating the field-specific dictionary in accordance with the method and system disclosed herein.

Figure 1

1


Page 02 of 3

In accordance with the steps illustrated in the flow chart, the method and system creates the field-specific dictionary by analyzing text of both the control corpus and the target corpus. The analysis produces results that includes, but need not be limited to, a list of words in both the control corpus and the target corpus, a total count of each word appearing in both the control corpus and the target corpus and an average frequency of each word within each document in both the control corpus and the target corpus. The analysis can be performed using a combination of text analysis techniques that can include one or more annotators created for the purpose of analysis.

After the analysis is complete, the method and system compares the control corpus

with the target corpus in order to create the field-specific dictionary. The method and system compares the control corpus with the target corpus by considering only words that are in the target corpus as a field specific terminology. The words that are in the target corpus are determined by a set theoretic difference of the target corpus and the control corpus. In addition, the method and system considers words common to both the control corpus and the target corpus t...