Browse Prior Art Database

An Algorithm for Detecting High Frequency Strings in Data

IP.com Disclosure Number: IPCOM000013492D
Original Publication Date: 2001-Jun-16
Included in the Prior Art Database: 2003-Jun-18

Publishing Venue

IBM

Abstract

The Problem solved Data, whether contained in files, processed by programs or being sent between computers, can often take the form of strings of characters. The problem addressed here is that of finding the most frequently occurring strings in data streams of unlimited length. Most search algorithms focus on finding pre-defined strings. They take two arguments, the input data and the search string and they look for occurrences of the search string in the input data. By contrast the problem solved here takes a single input argument; the data to be searched. The objective is to detect the "most important" strings in the data, where the criteria for being important is a combination of the length of the strings found, the number of times they occur and possibly how recently they have been found. It should be stressed that the "most important" strings are not known prior to execution. This type of algorithm has application in data mining, eBusiness market research, request systems with adaptive learning, chain letter detection, certain forms of virus signature detection, automated eBusiness audit techniques and automated acronym searching and updating.