Browse Prior Art Database

Method for efficient matching large number of filemasks against a filesystem

IP.com Disclosure Number: IPCOM000167488D
Original Publication Date: 2008-Feb-15
Included in the Prior Art Database: 2008-Feb-15
Document File: 6 page(s) / 230K

Publishing Venue

IBM

Abstract

The idea is to categorize the file masks (L) into several groups and introduce optiimizations for most of the file masks. Based on that categorization, the first speed can be gained by using hash tables where possible to identify if a file name from F set possibly matches at least one of file masks mentioned in the L set. If all the file masks fit into the optimized groups, then no checking against entire L set occurs and the speed up is enormous, especially if the size of that set exceeds 1000 files.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 45% of the total text.

Page 1 of 6

Method for efficient matching large number of filemasks against a filesystem

Introduction:

A good understanding of three concepts from computer science is needed to grasp the idea of described method properly. Namely lazy evaluation, hash table and hash function.

Lazy evaluation is the technique of delaying a computation until such time as the result of the computation is known to be needed.

In computer science, a hash table, or a hash map, is a data structure that associates keys with values. It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location (called slot or bucket) where the values should be. In our example hash tables have a static number of slots, 8, which means that the index is from 0 to 7. The most essential advantage of hash table is the fact that the time spent searching for the required data is independent of the number of items stored.

A hash function is a reproducible method of turning some kind of data into a (relatively) small number that may serve as a digital "fingerprint" of the data.

Example of a hash table

Now, let's follow through simple example. Consider expression mingw*. Its prefix (prefix = characters before asterisk) is inserted into Hash table HT.

1

[This page contains 1 picture or other non-text object]

Page 2 of 6

Mapping entry in HT

HT may look like on the picture above.

Typical hash function iterates over letters of string and computes hash value of string based on ascii value of letters and some bitwise operations.

For example:

int hash( const char *str )
{
int hash_value;
for (int i=0;str[i];i++) hash_value += str[i]; return hash_value % HT_SIZE; }

If one wants to check whether string mingw-make.exe matches any expression, the partial hash value (lazy evaluation) of the word is computed and in each step it is checked in HT if there is any entry in that particular row. In each step, a new hash index is calculated based on each character from the file name - m for first step, i for second step, n for third step and so on.

In this concrete example, in step 5 a partial hash value of mingw-make.exe that matches with entry from slot 3 (mingw) is found. Therefore it will be known that word mingw-make.exe matches with mingw*. It works in the same manner for expressions like *something, but then the partial hash values are simply computed from the end.

Description of the resolution:

The method relies on splitting the file masks that files will be matched against into the following 4 categories:

File masks without any special characters - this means that the mask specifies
1.

the full file name to use; for example: java.exe, startup.jar.

File masks in the form of

* - this means that there is only a string of 0 or
2.

2

[This page contains 1 picture or other non-text object]

Page 3 of 6

more non-special ch...