Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

A SIMILARITY MEASURE BETWEEN TWO STRINGS

IP.com Disclosure Number: IPCOM000128594D
Original Publication Date: 1978-Dec-31
Included in the Prior Art Database: 2005-Sep-16
Document File: 4 page(s) / 16K

Publishing Venue

Software Patent Institute

Related People

Nicholas V. Findler: AUTHOR [+3]

Abstract

A plausible similarity measure is presented for quantita- tively comparing two strings, i.e. two linearly ordered sets of elements. The strings can be of different lengths, the elements come from a single alphabet and an element may appear any number of times. The limiting values of the similarity measure are 0, when two completely different strings are com- pared, and 1, when the two strings are identical. Applications of such a measure are numerous in non-numeri-cal computations, such as in heuristic search procedures in associative networks, pattern recognition and classification, game playing programs, and in music and text analysis.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 4

THIS DOCUMENT IS AN APPROXIMATE REPRESENTATION OF THE ORIGINAL.

A SIMILARITY MEASURE BETWEEN TWO STRINGS

Nicholas V. Findler

December, 1976

Technical Report Number 117

ABSTRACT

A plausible similarity measure is presented for quantita- tively comparing two strings, i.e. two linearly ordered sets of elements. The strings can be of different lengths, the elements come from a single alphabet and an element may appear any number of times. The limiting values of the similarity measure are 0, when two completely different strings are com- pared, and 1, when the two strings are identical.

Applications of such a measure are numerous in non-numeri-cal computations, such as in heuristic search procedures in associative networks, pattern recognition and classification, game playing programs, and in music and text analysis.

The definition of a similarity measure

Let there be two strings, s 1 1 and s 2f consisting m and n ordered elements,, respectively. The elements come from the same alphabet which can be considered either finite or infinite. Each element may occur in a string an arbitrary number of times, up to the total length of the string in question. Without any loss of generality, we assume that m

(Equation Omitted)

which starts with the j-th element from the left and is

(Equation Omitted)

elements long. Let this substring occur p(jj) times (counting overlapping ones, too) in s and q(j,t) times in s 2* (Note that the argument j in q(j,-L) is an index in string s 1 and not in s 2* In other words, the substring which'starts with the j-th element of s 1 may occur in any position within s 2 and it occurs q times there.) The range of j is j=1,2 .... m and a substring starting with the j-th
element may have a length -f=1,2 .... (m-j+l). So, s 1 of length m has m substrings of length 1: (m-1) substrings of length

(Equation Omitted)

two substrings of length

(Equation Omitted)

State University of New York at Buffalo Page 1 Dec 31, 1978

Page 2 of 4

A SIMILARITY MEASURE BETWEEN TWO STRINGS

and one substring of length m: /a 11a2l ... a M/-

We can now define the similarity measure between s 1 and s 2 as

(Equation Omitted)

Here, the normalizing factor, (R(s 1'si ).R(s 21 s 2)] 2 makes

sure that the similarity measure between two identical strings equals 1. One can write, for example, for s:

(Equation Omitted)

An explanation of (1) is in order here. The first summation takes the contributions of substrings of different lengths, from 1 to m. The second summation considers sub-strings of s 1 starting at all possible element positions,

from I to (m-f+1). The symbol 1* means that we must consider a substring in s only at its leftmost occurrence; multiple occurrences are taken care of by the frequency

indicator p(j,-f).

The summand can be written as

(Equation Omitted)

where

(Equation Omitted)

is a weighting factor. It assumes the value 0, when s does not contain a particular substring of s 1 (then q=O), and the value 1, when s 1 and S' 2 contain the substri...