Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

An Algorithm for Detecting High Frequency Strings in Data

IP.com Disclosure Number: IPCOM000013492D
Original Publication Date: 2001-Jun-16
Included in the Prior Art Database: 2003-Jun-18
Document File: 6 page(s) / 44K

Publishing Venue

IBM

Abstract

The Problem solved Data, whether contained in files, processed by programs or being sent between computers, can often take the form of strings of characters. The problem addressed here is that of finding the most frequently occurring strings in data streams of unlimited length. Most search algorithms focus on finding pre-defined strings. They take two arguments, the input data and the search string and they look for occurrences of the search string in the input data. By contrast the problem solved here takes a single input argument; the data to be searched. The objective is to detect the "most important" strings in the data, where the criteria for being important is a combination of the length of the strings found, the number of times they occur and possibly how recently they have been found. It should be stressed that the "most important" strings are not known prior to execution. This type of algorithm has application in data mining, eBusiness market research, request systems with adaptive learning, chain letter detection, certain forms of virus signature detection, automated eBusiness audit techniques and automated acronym searching and updating.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 28% of the total text.

Page 1 of 6

An Algorithm for Detecting High Frequency Strings in Data

The Problem solved

Data, whether contained in files, processed by programs or being
sent between computers, can often take the form of strings of
characters. The problem addressed here is that of finding the
most frequently occurring strings in data streams of unlimited
length.

    Most search algorithms focus on finding pre-defined strings.
They take two arguments, the input data and the search string and
they look for occurrences of the search string in the input data.
By contrast the problem solved here takes a single input
argument; the data to be searched. The objective is to detect the
"most important" strings in the data, where the criteria for
being important is a combination of the length of the strings
found, the number of times they occur and possibly how recently
they have been found. It should be stressed that the "most
important" strings are not known prior to execution.

    This type of algorithm has application in data mining,
eBusiness market research, request systems with adaptive
learning, chain letter detection, certain forms of virus
signature detection, automated eBusiness audit techniques and
automated acronym searching and updating.

The Solution Overview

    The present algorithm is an adaptation of the well known
Limpel-Ziv data compression algorithm [1] and its extension the
LZW algorithm [2].

    The basic idea is to parse the input string in a single
pass, building a table of strings found. The strings are encoded
such that a fixed length encoding represents arbitrary length
strings. Such a technique will work well with modest length input
strings. However once all the table locations have been used, the
algorithm potentially moves into a "saturated phase" where no
new strings can be encoded. While this is may not be a
significant problem for data compression algorithms it is a
fundamental problem for search algorithms and the present
invention overcomes the saturation effect by introducing a table
reuse technique. As a consequence the input can be a data stream
of unlimited length.

Detail of Solution

    Like the prior art, the fixed length encoding is achieved
by considering a string ('wK') as consisting of two parts:

A prefix string 'w' of zero or more characters and
The latest character 'K'.

    Assume we have a fixed length coding for 'w' called c. Then
the coding for 'wK' is constructed by first taking a 'hash' of c
and K , where the resulting hash is an index into a fixed length

1

Page 2 of 6

table - the hash table. (NB the concept of hashing is well know
and provides an efficient software approximation of an
associative memory. Hashing involves the algorithmic combination
of c, K and the state of the hash table, to produce the index
into the hash table.)

    In the prior art, for each new index into the hash table, a
code 'cc' is assigned. This is usually an integer, where new
codes are obtained by simply incrementing a 'code' variable.
These 'cc' values are written into the hash table at the location
corre...