Browse Prior Art Database

Method to perform a Fast Search

IP.com Disclosure Number: IPCOM000132343D
Original Publication Date: 2005-Dec-08
Included in the Prior Art Database: 2005-Dec-08
Document File: 2 page(s) / 147K

Publishing Venue

IBM

Abstract

Disclosed is a solution for performing a fast search for a pattern in a datastream. The main idea is to use the positions of the last and before last occurrences of each letter in the pattern to minimize the number of comparisons. With this method, which requires a small amount of code, average processing time for practical applications is significantly reduced.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Method to perform a Fast Search

The algorithm takes a datastream D of size n and a pattern P of size m as input and returns the position in D of the first occurrence of P.

1) First time Initialization

This initialization is not necessary every time a pattern is searched for, but once and for all, if we consider that all subsequent data processed uses the same alphabet. This step consists in allocating an array I having at least as many cells as there are letters in the alphabet (e.g. for 8-bit coded characters, the array size is 256) and initializing all values of this array to Ø, which is a special value indicating that the letter does not appear in P.

2) Input pattern preprocessing

This step is performed only once for a given pattern. It consists in determining two items of information about occurrences in P of each letter of the alphabet in which D and P are defined. These are, for each letter: the position of its last occurrence, and the difference between that position and the position of the occurrence before last. They are stored respectively in two arrays named I and O, in which each element corresponds to a specific letter. A special value Ø is defined, and signifies (when in I) the absence of the corresponding letter in P. If a given letter appears only once in P, then the corresponding value stored in O is m.

Once all values in I are initialized to Ø (see step 1), this step can be summarized by the following pseudo-code:

i ← 1
while i ≤ m

O(P[i]) ← I(P[i])

    I(P[i]) ← i
i ← i + 1
i ← 1
while i ≤ m
if O(P[i]) = Ø, then O(P[i]) ← m
else O(P[i]) ← I(P[i])- O(P[i])
i ← i + 1

where P[i] is the ith letter of the pattern P, and I(x ) and O(x ) are the elements in I and O corresponding to the letter x .

Thefirst while loop stores, in I, the positions of the letters' last occurrences in P, and, in O, the positions of their occurrences before last (Ø if the letter appears only once in P). The second loop stores, in O, the differences between these two values (or m if it contained Ø).

The computation time of this step is proportional to the size of the pattern, since the operat...