Browse Prior Art Database

Faster Like Predicate for Database Manager

IP.com Disclosure Number: IPCOM000112363D
Original Publication Date: 1994-May-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 6 page(s) / 154K

Publishing Venue

IBM

Related People

Cross, MR: AUTHOR [+2]

Abstract

Many database operations involve determining if a data string is "like" a pattern. Examples are Create Database and Select statements with WHERE clauses. This determination is made so frequently that speed is vital. In early versions of OS/2* database managers the pattern matching algorithms made all decisions based on prior state, on current pattern character, and on current data character. The algorithm discussed here for use in later versions uses "look ahead" methods to gain execution speed.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 31% of the total text.

Faster Like Predicate for Database Manager

      Many database operations involve determining if a data string
is "like" a pattern.  Examples are Create Database and Select
statements with WHERE clauses.  This determination is made so
frequently that speed is vital.  In early versions of OS/2* database
managers the pattern matching algorithms made all decisions based on
prior state, on current pattern character, and on current data
character.  The algorithm discussed here for use in later versions
uses "look ahead" methods to gain execution speed.

      Given a data string with zero or more characters and a pattern
also with zero or more characters, does the data string match the
pattern?  The pattern can contain two types of wildcard characters.
The underscore character in the pattern matches any single data
string character against which it might be compared.  The percent
character in the pattern matches any zero or more characters (i.e.,
any substring) of the data string.  For example, the pattern "%ilk%"
matches the data strings "Milk", "silks", "ilk", and "milky".  A more
complex pattern could contain multiple imbedded percents.

      For the solution presented here suppose the following are
integers - int match = TRUE; /* FALSE ==> unLIKE */ int j,
size_chunk, not_found, len, leading_percent = FALSE; unsigned char
ch;

      Any solution must correctly handle three implied boundary
conditions.  The first is that a pattern consisting of only percents
characters matches any data string even the empty data string.  The
second is that a pattern consisting of one underscore character will
match any data string consisting of exactly one character.  Third, an
empty pattern matches an empty data string.  IF both pattern and data
string have length zero (that is if(len_d == 0) && (len_p == 0)),
return EQUAL.  Letting integers len_d and len_p denote the lengths of
data and pattern strings, the logic may be represented as follows:

      Let d and p be character pointers to the data string and to the
pattern string.

      IF the last pattern character isn't a percent, there is
substring p' (possibly empty) of the pattern p such that either p is
p' or the first character of p' is the last character of p preceded
by a percent.  If there is no substring d' of the data string d such
that d' exactly matches p', then p is not "LIKE" d. In other words,
if the pattern doesn't end with a percent then the end part
(following the last percent) of the pattern EXACTLY matches the end
part of the data string else the strings are not "LIKE".  Across the
majority of cases where the pattern doesn't end with a percent,
starting the comparison at the right end gives a performance gain by
trying to arrive at an early mismatch.  If the pattern doesn't end in
percent, starting at the ends, compare a data character against the
corresponding pattern character.  If on a comparison, the pattern
character is neither an underscore...