Browse Prior Art Database

Automatic Determination of the True Case of a Word

IP.com Disclosure Number: IPCOM000111524D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 89K

Publishing Venue

IBM

Related People

Brown, PF: AUTHOR [+4]

Abstract

Disclosed is a method for determining the proper capitalization of the letters in a word in the context of surrounding words. If a word appears with all of its letters in lower case, we say that its case profile is l*. If it appears with all of its letters in upper case, then its case profile is u*. If the letters of a word appear in a mixture of cases, then the case profile is a mixture of u's and l's ending with an asterisk. For example, the case profile for McDonald is ulul* and for IBMers, it is uuul*. Sometimes, the meaning of a word depends on its case profile. Thus, in CAT scan, the first word is an abbreviation of Computer Aided Tomography, but in cat gut, the first word simply refers to a cat.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Automatic Determination of the True Case of a Word

      Disclosed is a method for determining the proper capitalization
of the letters in a word in the context of surrounding words.  If a
word appears with all of its letters in lower case, we say that its
case profile is l*.  If it appears with all of its letters in upper
case, then its case profile is u*.  If the letters of a word appear
in a mixture of cases, then the case profile is a mixture of u's and
l's ending with an asterisk.  For example, the case profile for
McDonald is ulul*  and for IBMers, it is  uuul*.  Sometimes, the
meaning of a word depends on its case profile.  Thus, in CAT scan,
the first word is an abbreviation of Computer Aided Tomography, but
in cat gut, the first word simply refers to a cat.

      The case profile that a word should have because of its meaning
is called its true case profile.  Good editing may cause the case
profile of a word in text to differ from its true case profile
because of the various conventions regarding case as presented in
proper orthography.  Poor editing, of the type common in electronic
mail, or in some old databases, may also cause the case profile of a
word to be different from its true case profile.

      Let f be a file of characters making up some (well-edited)
collection of text.  The following algorithm replaces the case
profile with the true case profile.

1.    Determine the inventory of distinct letter sequences in f.

2.    Determine the beginnings of sentences in f.

3.    For each distinct letter sequence, determine the frequency of
    each case profile when that letter sequence is not the first
    letter sequence in a sentence.

4.    For each distinct letter sequence, assign a value called its
    adjusted entropy such that if the number of times that the
    sequence of letters appears in f other than at the beginning of a
    sentence exceeds some threshold, then the adjusted entropy is the
    entropy of the distribution of case profiles for the sequence of
    letters in positions other than the beginning of a sentence.
    Otherwise, the adjusted entropy of a letter sequence is infinite.

5.    For each distinct letter sequence with an adjusted entropy less
    than some threshold, assign an assumed case profile equal to the
    most frequent case profile among occurrences of the letter
    sequence in positions other than the beginning of a sentence.

6.    For each letter sequence in f, assign a true case profile
    according to the following rules:

    a.    If the letter sequence has an assumed case profile, then
        assign the assumed case profile as the true case profile for
        the letter sequence.

    b.    Otherwise, if the letter sequence is not at the beginning
 ...