Browse Prior Art Database

Prefilter index for efficient irregular access to irregular data

IP.com Disclosure Number: IPCOM000011783D
Original Publication Date: 2003-Mar-14
Included in the Prior Art Database: 2003-Mar-14
Document File: 4 page(s) / 58K

Publishing Venue

IBM

Abstract

This discloses a variant of 'superimposed coding' for access to irregular data. It is applicable to areas such as XML, JMS messages, and tuple spaces.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 28% of the total text.

Page 1 of 4

Prefilter index for efficient irregular access to irregular data

Background

Most database and messaging systems today are tuned to fairly rigid data with a fairly rigid pattern of access. This is to a large extent embodied in the database schema, with preferred access patterns optimized by use of indexes. Messaging systems often have even more rigid query flexibility.

     While access outside the basic framework is possible, it tends to be quite inefficient, involving a full linear scan of the data, and test of the query against each record.

     There is increasing need for more flexible data and access patterns. This can be seen in XML data, Java* Messaging Service (JMS) messaging, and in tuple spaces. This disclosure suggests a scheme for such flexible data and access. The way the scheme was used may be different for these different applications, but the principle is the same.

     The system described is a variant of "superimposed coding" (eg page.60 of "Information Retrieval" by Frakes and Baeza-Yates describes Gustafson's method) used for less regular data.

Outline design

     The principle is to use a prefilter that can very quickly eliminate many records (messages/tuples) from consideration. The prefilter information may be held with the records, or in a separate index file. The scheme does not eliminate a sequential scan (of the index or of the prefilter), but makes it very much less expensive.

     The prefilter uses a signature of each record, and a signature of the query. There is a potential match when every bit in the query signature is set in the record signature. Signature generation does not involve any pre-knowlege of the field names involved in the records or in the query. The set of names need not be the same for each record, or even the number of name/value pairs.

More detailed design

     We assume each record consists of a set of name/value string pairs. (This easily extends to typed data.) Initially we also assume that the query is a conjunctive equality query (that is, consists only of the AND of terms, each term of the form name='value'), and that a query against data missing in the record is assumed to be a NO match.

     The signature of a record is an N bit array (eg 64 bits). Each name/value pair is mapped using a hashing function (see below) into an N bit array. The hash function is chosen such that only a small number of bits (eg 3) is set on in result. The signature of a record is the OR of the hash of each field.

     The signature of a query is an N bit array. It is constructed from name/value pairs in the same way that the signature is constructed from a record, using the same hashing function. [In fact, a conjunctive equality query can considered equivalent to a record.]

     The signature of the record is said to match the signature of the query if every bit set in the signature of the query is also set in the signature of the record. In this case, the record must be properly matched against the query to guard against false positives. T...