Browse Prior Art Database

A Practical Algorithm for Frequent Substring Pattern Mining with Ternary Partitioning

IP.com Disclosure Number: IPCOM000019354D
Original Publication Date: 2003-Sep-12
Included in the Prior Art Database: 2003-Sep-12
Document File: 2 page(s) / 17K

Publishing Venue

IBM

Abstract

Disclosed is a novel algorithm which enumerates all substrings appearing more frequently than some threshold in a given string. In addition, this algorithm generates a data structure that is useful for browsing frequent substrings and their contexts in the original text. This algorithm is based on divide-and-conquer approach, that is, it decomposes the mining task into a set of smaller tasks by the ternary partitioning technique.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

A Practical Algorithm for Frequent Substring Pattern Mining with Ternary Partitioning

   Disclosed is a novel algorithm which enumerates all substrings appearing more frequently than some threshold in a given string. The new approach is faster and requires less working memory than existing algorithms. In addition, this algorithm generates a data structure that is useful for browsing frequent substrings and their contexts in the original text.

First, string, substring, and other key terms are defined as follows in this disclosure. A string s is an ordered list of characters written consecutively in this disclosure. The set of the characters is denoted asΣ. In particular, $ is used for the termination character where $ is not in Σ. For any string s, s[i&Thorn;ichj] is the substring of s that begins at position i and ends at position j of s. In particular, s[i&Thorn;ichn] is the suffix of string s that begins at position i where n denotes the length of string s. For example, kuras is the substring (s[3...
7]) of the string s = sakurasaku$, and kurasaku is suffix(s[3&Thorn;10]) of s. In addition, a larger substring subi that includes the substring subj is called the super substring pattern of subj. Next, the frequent substring pattern mining problem is stated as follows. Let count(p) be the number of occurrence of a substring p in a string s. Given a positive ξ as the minimum support threshold, a substring p is a frequent substring pattern of s if count(p) is greater or equal thanξ. The frequent substring pattern mining problem is the enumeration of all frequent substring patterns in s. For instance, if the string s = sakurasaku$ and the thresholdξ = 2 then all frequent substring patterns are a, ak, aku, k, ku, s, sa, sak, saku, and u. The main idea of the proposed algorithm is based on a divide-and-conquer approach, that is, it recursively decomposes the mining task into a set of smaller tasks using the ternary partitioning technique [1]. Let S = [suffix1,&Thorn;, suffixn] be an array of all suffixes (s[1&Thorn;n], s[2&Thorn;ochn], &Thorn;, s[n - 1&Thorn;n], s[n]) in a string s. The S is divided based on the dth character of suffixes into smaller arrays. In other word, given a partition value v, S is divided into 3 smaller arrays, S=, S<, and S> where S= includes suffixes whose dth character equals v, S< includes suffixes whose dth character is lexicographically smaller than v, and S> includes suffixes whose dth character is larger than v. Then, if the size of S, n=, is greater than or equal to a minimum support threshold ξ (i.e. n= is greater than or equal to ξ), it is a frequent substring pattern that begins at position 1 and ends at position d of suffixi in S=. This process is recursively invoked for S<, S> based on the dth character and for S= based on the d+1th character. The recursion will stop if the size of the partitioned arrays (n<, n...