Browse Prior Art Database

Searching a Data File for Words in the Presence of Spelling Differences

IP.com Disclosure Number: IPCOM000081507D
Original Publication Date: 1974-Jun-01
Included in the Prior Art Database: 2005-Feb-28
Document File: 2 page(s) / 14K

Publishing Venue

IBM

Related People

Atrubin, AJ: AUTHOR [+3]

Abstract

The growing use of large data-base systems requires the development of efficient searching methods for finding data associated with an entered word, or phrase. Many techniques have been found for performing this function where a file word exactly matches the entered word. In many important cases, however, the entered word will not exactly match any word in the data file, because of input typographical errors, character recognition errors, data-transmission errors, and so forth. In these situations, it would be desirable to find the data-file word which best matches the entered word.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Searching a Data File for Words in the Presence of Spelling Differences

The growing use of large data-base systems requires the development of efficient searching methods for finding data associated with an entered word, or phrase. Many techniques have been found for performing this function where a file word exactly matches the entered word. In many important cases, however, the entered word will not exactly match any word in the data file, because of input typographical errors, character recognition errors, data-transmission errors, and so forth. In these situations, it would be desirable to find the data-file word which best matches the entered word.

Efficient searching for best matches in a file may be accomplished by a different representation of the file words to be matched. In conventional files, words are stored in a memory in the form of bit strings representing individual characters. The word "CENTER", e.g., is a binary EBCDIC code for a "C", followed by an EBCDIC code for an "E", etc. The present technique uses an expanded code which specifies both individual characters and their positions within a word. The table below shows a storage area having a single-bit position for every letter of the alphabet, for every possible position in a word. A 0 in the C1 column, for example, indicates that the initial character of that word is a C, while a 0 in the E2 column shows that E is the second character of that word. The word center, then, has 0's in columns C1, E2, N3, T4, E5 and R6, and 1's in all other columns. A B C D E...A B C D E...M N O P Q...O T V...E N...R S 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 5 5 6 6 VTA CENTER 1 1 0 1 1...1 1 1 1 0...1 0 1 1 1...1 0 1...0 1...0 1 3 DENVER 1 1 1-0 1...1 1 1 1 0...1 0 1 1 1...1 1 0...0 1...0 1 1 EAMONS 1 1 1 1 0...0 1 1 1 1...0 1 1 1 1...0 1 1...1 0...1 0 6 DEPVER * * * * *
*. The word "DENVER" has 0's in columns D1, E2, N3, V4, E5, and R6, and 1's elsewhere. The word "EAMONS" has 0's in columns E1, A2, M3, O4, N5, and S6. Other words are similarly coded.

Now suppose that the trial word "DEPVER" is entered into this file. This word acts to select the table columns D1, E2, P3, V4, E5, and R6; all other columns are disregarded. In...