# A SIMILARITY MEASURE BETWEEN TWO STRINGS

Original Publication Date: 1978-Dec-31

Included in the Prior Art Database: 2005-Sep-16

## Publishing Venue

Software Patent Institute

## Related People

Nicholas V. Findler: AUTHOR [+3]

## Abstract

A plausible similarity measure is presented for quantita- tively comparing two strings, i.e. two linearly ordered sets of elements. The strings can be of different lengths, the elements come from a single alphabet and an element may appear any number of times. The limiting values of the similarity measure are 0, when two completely different strings are com- pared, and 1, when the two strings are identical. Applications of such a measure are numerous in non-numeri-cal computations, such as in heuristic search procedures in associative networks, pattern recognition and classification, game playing programs, and in music and text analysis.

**This text was extracted from a PDF file.**

**This is the abbreviated version, containing approximately 51% of the total text.**

__Page 1 of 4__THIS DOCUMENT IS AN APPROXIMATE REPRESENTATION OF THE ORIGINAL.

**A SIMILARITY MEASURE BETWEEN TWO STRINGS **

Nicholas V. Findler

December, 1976

Technical Report Number 117

**ABSTRACT **

A plausible similarity measure is presented for quantita- tively comparing two strings, i.e. two linearly ordered sets of elements. The strings can be of different lengths, the elements come from a single alphabet and an element may appear any number of times. The limiting values of the similarity measure are 0, when two completely different strings are com- pared, and 1, when the two strings are identical.

Applications of such a measure are numerous in non-numeri-cal computations, such as in heuristic search procedures in associative networks, pattern recognition and classification, game playing programs, and in music and text analysis.

**The definition of a similarity measure **

Let there be two strings, s 1 1 and s 2f consisting m and n ordered elements,, respectively. The elements come from the same alphabet which can be considered either finite or infinite. Each element may occur in a string an arbitrary number of times, up to the total length of the string in question. Without any loss of generality, we assume that m

(Equation Omitted)

which starts with the j-th element from the left and is

(Equation Omitted)

elements long. Let this substring occur p(jj) times (counting overlapping ones, too) in s and q(j,t)
times in s 2* (Note that the argument j in q(j,-L) is an index in string s 1 and not in s 2* In other
words, the substring which'starts with the j-th element of s 1 may occur in any position within s 2
and it occurs q times there.) The range of j is j=1,2 .... m and a substring starting with the j-th

element may have a length -f=1,2 .... (m-j+l). So, s 1 of length m has m substrings of length 1:
(m-1) substrings of length

(Equation Omitted)

two substrings of length

(Equation Omitted)

State University of New York at Buffalo Page 1 Dec 31, 1978

__Page 2 of 4__A SIMILARITY MEASURE BETWEEN TWO STRINGS

and one substring of length m: /a 11a2l ... a M/-

We can now define the similarity measure between s 1 and s 2 as

(Equation Omitted)

Here, the normalizing factor, (R(s 1'si ).R(s 21 s 2)] 2 makes

sure that the similarity measure between two identical strings equals 1. One can write, for example, for s:

(Equation Omitted)

An explanation of (1) is in order here. The first summation takes the contributions of substrings of different lengths, from 1 to m. The second summation considers sub-strings of s 1 starting at all possible element positions,

from I to (m-f+1). The symbol 1* means that we must consider a substring in s only at its leftmost occurrence; multiple occurrences are taken care of by the frequency

indicator p(j,-f).

The summand can be written as

(Equation Omitted)

where

(Equation Omitted)

is a weighting factor. It assumes the value 0, when s does not contain a particular substring of s 1 (then q=O), and the value 1, when s 1 and S' 2 contain the substri...