Browse Prior Art Database

High-performance string search using differences between frequencies with which characters are generally used.

IP.com Disclosure Number: IPCOM000013060D
Original Publication Date: 2000-Sep-01
Included in the Prior Art Database: 2003-Jun-12
Document File: 4 page(s) / 33K

Publishing Venue

IBM

Abstract

Disclosed is a algorithm for high-performance string search using the differences between the frequencies with which characters are generally used in a language. "String search" means investigating if there is a coincident part with a string, "pattern", in a long string, "text". In addition, it returns the location of the part if there. A general algorithm for string search is as follows:

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 47% of the total text.

Page 1 of 4

  High-performance string search using differences between frequencies with which characters are generally used.

    Disclosed is a algorithm for high-performance string search using the differences between the
frequencies with which characters are generally used in a language.

"String search" means investigating if there is a coincident part with a string, "pattern", in a
long string, "text".

In addition, it returns the location of the part if there.

A general algorithm for string search is as follows:

1. Superpose the pattern on the text, and investigate if they coincide with each other by
comparing each character from left to right.
2. If they do not coincide, repeat the same way after moving the location for superposing the
pattern right by one character.

Now, there are quite a little differences between the frequencies with which characters are
generally used in a language.

For example, as to English, the use frequency of "e" is about 12%, and the one of "j" is about
0.05%.

This inventory is searching a pattern quickly by comparing each characters in order of the use
frequency (from the least to the most), instead of comparing each character from left to right.

In case of searching "eject" in a text:

- The general algorithm compares each character in order: "e"->"j"->"e"->"c"->"t".
- This inventory compares each character in order: "j"->"c"->"t"->"e"->"e" because each use
frequency is "e":12%, "j":0.05%, "c":4% and "t":8%.

The number of comparing can be reduced by using this inventory.

In case that some English news files on a web site including 1,500,000 characters are used as a
text, and 30 English words selected at random are used as a pattern, the numbers of comparing are
reduced by 7.9% at most.

The average of the reductions is 1.6%.

Below are the sample programs:

      General algorithm for string search
/********************************************************

Function name:
naiveMatch

Format:

char *naiveMatch(char *text, int len, char *pattern, int patlen);

Arguments:
text (In)Text
len (In)Text length
pattern (In)Pattern
patlen (In)Pattern length

Return value:

       Pointer if pattern found in text.
0 if pattern not found in text.
*********************************************************/

char *naiveMatch(char *text, int len, char *pattern, int patlen)
{
char c;
char *tp, *pp;
char *textEnd;
char *patEnd;
int count=0; // Number of comparing

if (patlen == 0) return text;
if (len < patlen) return NULL;

textEnd = text + len - patlen + 1;
patEnd = pattern + patlen;

c = *pattern++;

1

Page 2 of 4

while (text != textEnd) {
if (*text++ == c) {
tp = text;
pp = pattern;
do if (pp == patEnd) {
printf("%d\n", count);
return text - 1;

} else {
count++;
}

           while (*tp++ == *pp++);
}

}

printf("%d\n", count);
return NULL;

}

      This inventory
/********************************************************

Function name:
frqMatch

Format:

char *frqMatch(char *text, int len, char *pattern, int patlen);

Arguments:
text (In)Text
len (In)Text length
pattern (In)Pattern
patlen (In)Pattern length

Return valuse:...