Browse Prior Art Database

Alembic Disclosure Number: IPCOM000008421D
Publication Date: 2002-Jun-13
Document File: 8 page(s) / 55K

Publishing Venue

The Prior Art Database

Related People

MITRE Technology Transfer Office: SUBMITTER


Natural language processing that improves the performance of information systems.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 9% of the total text.

Appeared in: Fifth Conference on Applied Natural Language Processing, 1997, Association for Computational Linguistics, 31 March -- 3 April, Washington D.C., U. S. A.

Mixed-Initiative Development of Language Processing Systems

David Day, John Aberdeen, Lynette Hirschman,

Robyn Kozierok, Patricia Robinson and Marc Vilain

Advanced Information Systems Center

The MITRE Corporation

202 Burlington Road

Bedford, Massachusetts 01730 U.S.A.




Historically, tailoring language-processing systems to specific domains and languages for which they were not originally built has required a great deal of effort.  Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to “bootstrapping” the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate named entities demonstrates that these approaches can approximately double the production rate. As an added benefit, the combined efforts of machine and user produce domain-specific annotation rules that can be used to annotate similar texts automatically through the Alembic NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.

1. Introduction

In the absence of complete and deep text understanding, implementing information extraction systems remains a delicate balance between general theories of language processing and domain-specific heuristics. Recent developments in the area of corpus-based language processing systems indicate that the successful

application of any system to a new task depends to a very large extent on the careful and frequent evaluation of the evolving system against training and test corpora. This has focused increased attention on the importance of obtaining reliable training corpora. Unfortunately, acquiring such data has usually been a

labor-intensive and time-consuming exercise.

The goal of the Alembic Workbench is to dramatically accelerate the process by which language processing systems are tailored to perform new tasks. The philosophy motivating our work is to maximally reuse and re-apply every kernel of knowledge available at each step of the tailoring process. In particular, our

approach applies a bootstrapping procedure to the development of the training corpus itself. By re-investing the knowledge available in the earliest training data to pre-tag subsequent untagged data, the Alembic Workbench can transform the process of manual tagging to one dominated by manual...