Dismiss
The Prior Art Database and Publishing service will be updated on Sunday, February 25th, from 1-3pm ET. You may experience brief service interruptions during that time.
Browse Prior Art Database

# Selection of Optimum Indexing Keywords in Natural Text Samples

IP.com Disclosure Number: IPCOM000079492D
Original Publication Date: 1973-Jul-01
Included in the Prior Art Database: 2005-Feb-26
Document File: 3 page(s) / 19K

IBM

## Related People

Dennis, SF: AUTHOR

## Abstract

This automatic indexing method utilizes an "entropy" approach to the selection of indexing keywords.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

Selection of Optimum Indexing Keywords in Natural Text Samples

This automatic indexing method utilizes an "entropy" approach to the selection of indexing keywords.

Entropy is a measure of the uniformity with which items having a certain identity are distributed throughout a larger body of items.

In a collection or file of documents, the distinctiveness or selective retrievability of any document depends upon the presence therein of words having low-relative entropy, i.e., words which occur in few, if any, other documents. Stated somewhat differently, the keyword that most uniquely identifies the contents of a given document within a file will be the word whose actual entropy value differs most widely from the maximum value which the entropy of that word could have, if the total occurrences of the word in the file were distributed as evenly as possible among the file documents.

The general formula for the entropy of a word is: E = sigma/N/(i-1) f(i) over F log F over f(i), (1) where "i" is the serial number of a document in the file, "f(i)" is the frequency of the word within that document, "F" is the total frequency of the word in the file, and "N" is the number of documents in the file.

The entropy of any given word chosen for indexing purposes would have its maximum possible value, if the frequency F of that word within the file were to differ by less than N from each of the values Nf(i). For simplicity, however, the maximum entropy, E(m), is herein determined for an assumed state wherein the word occurs with equal frequency in all documents. Under this assumption, the general equation (1) for entropy reduces to: E(m) = log F (2).

In selecting an optimum set of words to index a file of documents, it is proposed herein to draw an analogy between this selection process and...