Dismiss
The Prior Art Database and Publishing service will be updated on Sunday, February 25th, from 1-3pm ET. You may experience brief service interruptions during that time.
Browse Prior Art Database

# Efficient String Searching Algorithm Using N-GRAM Statistics

IP.com Disclosure Number: IPCOM000035944D
Original Publication Date: 1989-Aug-01
Included in the Prior Art Database: 2005-Jan-28
Document File: 6 page(s) / 45K

IBM

## Related People

Kitamura, H: AUTHOR [+2]

## Abstract

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 30% of the total text.

Page 1 of 6

Efficient String Searching Algorithm Using N-GRAM Statistics

Conventional fast string matching methods, such as the KMP method and the BM method, analyze a pattern first and calculate how many characters should be skipped without inspecting the text. However, they do not consider the characteristics of the texts to be searched. The texts, which may be natural language texts, computer programs, or data, all have different features. Our approach takes account of the characteristics of texts by using N-GRAM statistics. The stochastic model of our approach will minimize the total searching cost.

Configuration: Fig. 1 System Configuration: SEE ORIGINAL FOR DIAGRAM

The N-Gram data is a table of probabilities of sequences of N characters in texts.

Our system has two major components. The pattern analyzer preprocesses the input pattern and calculates the sequence of characters to be compared and how many characters should be skipped if the comparison fails. The text matcher inspects the text according to the output of the pattern analyzer.

In the next section, we describe the mechanism of the text matcher, then we estimate the cost of searching, and, finally, we describe an efficient algorithm for calculating the character-searching sequence that will minimize the total searching cost. Text Matcher

We assume that the text matcher needs the following three tables: Character Order Table which contains the characters in the pattern in the order to be compared. Relative Location Table which contains the relative distance from the previous character in the comparison. Skip Table which contains the number of characters that can be skipped if the comparison fails.

For convenience, we assume that text(i) means the i-th character in the input text, and that pattern(i) means the i-th character in the pattern. The total number of characters in the text is N, and the length of the pattern is n. The variable name text_index is used for an index to the input text, and the variable name pattern_index is used for an index to the pattern. tbl1 means the character order table, tbl2 means the relative location table, and tbl3 means the skip table.

The algorithm for the text matcher is shown below.

text_index=tbl2(1);

first_char=tbl(1); loop1:

pattern_index=1; loop2:

if End_of_Text? then return;

if first_charµ=text(text_index) then do;

1

Page 2 of 6

text_index=text_index+tbl3(1);

goto loop2;

end;

if n=1 then do;

PatternHasMatched!

text_index=text_index+tbl3(2);

goto loop2;

end;

old_text_index=text_index;

pattern_index=2;

text_index=text_index+tbl2(2); loop3:

if tbl1(pattern_index)=text(text_index) then do;

if pattern_index = n then do;

PatternHasMatched!

text_index=old_text_index+tbl3(pattern_);

goto loop1;

end;

text_index=text_index+tbl2(pattern_index);

pattern_index=pattern_index+1;

goto loop3;

end;

text_index=old_text_index+tbl3(pattern_index);

goto loop1; Pattern Analyzer...