Publication Date: 2002-Jun-13
The IP.com Prior Art Database
Natural language processing that improves the performance of information systems.
Appeared in: Fifth Conference on Applied Natural Language Processing, 1997, Association for Computational Linguistics, 31 March -- 3 April, Washington D.C., U. S. A.
Mixed-Initiative Development of Language Processing Systems
Advanced Information Systems Center
202 Burlington Road
Historically, tailoring language-processing systems to specific domains and languages for which they were not originally built has required a great deal of effort. Recent advances in corpus-based manual and automatic training methods have shown promise in reducing the time and cost of this porting process. These developments have focused even greater attention on the bottleneck of acquiring reliable, manually tagged training data. This paper describes a new set of integrated tools, collectively called the Alembic Workbench, that uses a mixed-initiative approach to “bootstrapping” the manual tagging process, with the goal of reducing the overhead associated with corpus development. Initial empirical studies using the Alembic Workbench to annotate named entities demonstrates that these approaches can approximately double the production rate. As an added benefit, the combined efforts of machine and user produce domain-specific annotation rules that can be used to annotate similar texts automatically through the Alembic NLP system. The ultimate goal of this project is to enable end users to generate a practical domain-specific information extraction system within a single session.
In the absence of complete and deep text understanding, implementing information extraction systems remains a delicate balance between general theories of language processing and domain-specific heuristics. Recent developments in the area of corpus-based language processing systems indicate that the successful
application of any system to a new task depends to a very large extent on the careful and frequent evaluation of the evolving system against training and test corpora. This has focused increased attention on the importance of obtaining reliable training corpora. Unfortunately, acquiring such data has usually been a
labor-intensive and time-consuming exercise.
The goal of the Alembic Workbench is to dramatically accelerate the process by which language processing systems are tailored to perform new tasks. The philosophy motivating our work is to maximally reuse and re-apply every kernel of knowledge available at each step of the tailoring process. In particular, our
approach applies a bootstrapping procedure to the development of the training corpus itself. By re-investing the knowledge available in the earliest training data to pre-tag subsequent untagged data, the Alembic Workbench can transform the process of manual tagging to one dominated by manual...