Browse Prior Art Database

Automatic Taxonomy Generator System

IP.com Disclosure Number: IPCOM000021985D
Original Publication Date: 2004-Feb-18
Included in the Prior Art Database: 2004-Feb-18
Document File: 1 page(s) / 32K

Publishing Venue

IBM

Abstract

This article describes the Automatic Taxonomy Generator (ATG), a core component of the Lotus Discovery Server (LDS). ATG generates taxonomies from a given collection of documents. The new contribution described in this article is the implementation of the ATG system, which is based on the innovative ideas of combination of "merge" and "split" movements in a local search algorithm. Although each of these movements has been described in the literature independently, the combined algorithm, augmented by the feature selection and post-processing steps is a novel concept. The core algorithm begins by defining an initial feasible solution, generated using a clustering algorithm, which corresponds to a taxonomy with one layer. This solution is then progressively improved by a sequence of complex local search moves. These local search moves involve either merging clusters to form a new intermediate cluster or splitting clusters to form new leaf clusters. Different feature selection algorithms are used while performing each merge or split. These algorithms stop when a local maximum of the objective function is reached. Once ATG has generated an initial taxonomy, LDS provides tools to edit the taxonomy to rearrange the automatically generated taxonomy.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 1

Automatic Taxonomy Generator System

In recent years significant attention has been paid to the dramatic growth of text documents on the Web. However, comparatively less attention has been paid to documents present within the Intranet of large organizations. One important difference between Internet and Intranet documents is the fact that documents present in the Intranet not only include Web pages, but also other legacy documents such as PDF and Postscript files, Microsoft Word documents, and presentation files. Defining an appropriate organization of these documents is a very important problem that has considerable implications in several areas, including expertise location and topic detection. In addition, conventional search such an organization documents can assist in extracting business and market intelligence.

This article describes the Automatic Taxonomy Generator (ATG), a core component of the Lotus Discovery Server (LDS). ATG generates taxonomies from a given collection of documents. Preprocessing of different document formats is accomplished by other components of LDS, which generate a conventional bag-of-words representation for the documents. In this representation documents are composed of tokens, which can be either words or phrases. ATG consists of two main components: (i) statistical models for the taxonomy and (ii) search algorithms that find the best taxonomy from a set of documents. ATG casts the problem of generating taxonomies as a problem of model selection where the final model selected is the one with the highest posterior probabilit...