Browse Prior Art Database

Document Categorization using Lexical Analysis and Fuzzy Sets

IP.com Disclosure Number: IPCOM000108622D
Original Publication Date: 1992-Jun-01
Included in the Prior Art Database: 2005-Mar-22
Document File: 1 page(s) / 47K

Publishing Venue

IBM

Related People

Hawkins, R: AUTHOR [+2]

Abstract

A program is disclosed which extracts categorization information from the text in database records rather than the database fields, and uses fuzzy set theory to represent uncertainty in the categories. Applications include summary reports of document databases using non-database categories, and preparation of database keys for searches based on similar categories.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 71% of the total text.

Document Categorization using Lexical Analysis and Fuzzy Sets

       A program is disclosed which extracts categorization
information from the text in database records rather than the
database fields, and uses fuzzy set theory to represent uncertainty
in the categories.  Applications include summary reports of document
databases using non-database categories, and preparation of database
keys for searches based on similar categories.

      Information extraction is at the lowest or lexical level of
text analysis.  Input is unrestricted, natural language text.  The
output is a stream of lexical tokens similar to those output by a
compiler.  The tokens are the original input words, with meaningless
words discarded (user-supplied list), and the remaining words
mechanically converted to word stems.  The stems also have a soundex
algorithm applied to accommodate misspellings and typographical
errors.  At this level, high accuracy is not required since the
categorization stage will adopt a statistical approach.

      The program's intelligence over a vocabulary domain is provided
in two user-prepared lists collectively referred to as the
"dictionary."  The first list specifies the categories by which the
documents will be classified.  These categories are treated as
classical fuzzy sets.  The dictionary also lists those words whose
appearance in a document implies membership in one or more of the
fuzzy sets.

      A search of the dictionary is performe...