# One Pass Discovery of Repeated Substrings of a String

Original Publication Date: 1972-Mar-01

Included in the Prior Art Database: 2005-Feb-24

## Publishing Venue

IBM

## Related People

## Abstract

These two variants of an algorithm, given positive integers A and L, will find (1) all length L substrings of an input string S over an alphabet of A symbols; and (2) for each such substring found, the number of occurrences of that substring in S and the positions in S at which that substring begins. The algorithm requires a number of steps which is linear in the length of S (i number of positions in S), independent of the parameter L, providing that each of the following is counted as a primitive step: (1) Executing a PUSH on a pushdown list; and either (2) Accessing randomly any of (A+1).A/L/ locations; or (3) (i) Accessing randomly any of AL locations, and (ii) computing A.a+b (mod A/L/) for any aEpsilon{0,..., A/L/-1} and bepsilon(0,..., A/L/-1) and bepsilon (0,.... A-1).

**This text was extracted from a PDF file.**

**At least one non-text object (such as an image or picture) has been suppressed.**

**This is the abbreviated version, containing approximately 51% of the total text.**

__Page 1 of 4__**One Pass Discovery of Repeated Substrings of a String **

These two variants of an algorithm, given positive
integers A and L, will find (1) all length L substrings of an input
string S over an alphabet of A symbols; and (2) for each such
substring found, the number of occurrences of that substring in S and
the positions in S at which that substring begins. The algorithm
requires a number of steps which is linear in the length of S (i
number of positions in S), independent of the parameter L, providing
that each of the following is counted as a primitive step:

(1) Executing a PUSH on a pushdown list;

and either

(2) Accessing randomly any of (A+1).A/L/ locations;

or

(3) (i) Accessing randomly any of AL locations, and (ii)
computing A.a+b (mod A/L/) for any aEpsilon{0,..., A/L/-1}
and bepsilon(0,..., A/L/-1) and bepsilon (0,.... A-1).

In view of the rate of growth of A/L/ as a function of L, the proposed algorithm is probably most practical when L is not too big.

The input/output behavior of the proposed algorithm is illustrated by the following example. A = 2 Parameters: L = 3 Input: S = 0100001101010 Output: Strings Positions 000 3,4 001 5 010 1,9,11 011 6 100 2 101 8,10 110 7.

An additional pass over the output will suffice to filter out overlapping repetitions, if such is desired.

The basic algorithm is most easily described in terms of the following labeled directed graph. First, the parameters A and L are Vertex Set: V = {0,1,2,..., A/L/- 1}.

Edge Set: For each vertex vEpsilonV and each integer nEpsilon{0,..., A-1}, there is a directed edge labeled n from node v to node A-v+n(mod A/L/).

Referring to the above example (1), where A = 2, L = 3, the graph appears as follows:

(Image Omitted)

Assume that each node in the graph has associated with it a bucket capable of holding an arbitrary finite set of integers.

Now index the (cardinality A) alphabet over which our input strings will be encoded as (omega(0),..., omega(A-1)). Associate with each symbol omega its index absolute value of omega(i) = i.

Given any input string S, let S(k) denote the k/th/ symbol of S, so that absolute value of S(k) is the index of this symbol.

In summary, the steps of the algorithm are: STEP 1

1

__Page 2 of 4__(1) Read the first L symbols of S. Interpret them as the base

A representation of an integer c as follows:

(Image Omitted)

(2) Set the "position counter" P to L+1.

(3) Go to STEP 2.

STEP 2

(1) Drop the current value (P-L) into the bucket of node c,

using the current value of c.

(2) Go to STEP 3.

STEP 3

(1) If P exceeds the length of S, then EXIT.

(2) Else, respecify c to A.c+ absolute value of S(P) (mod A/L/).

This is equivalent to respecifying c by following
the directed edge labeled absolute value of S(P) from node c.

(3) Respecify P to P+1.

(4) Go to STEP 2.

After EXITing, if one associates with each node c in the graph the (unique) length L base A representation of c, then the bucket of c contains all positions of S at which that string begins.

Some c...