Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Automatic Pattern Extraction from Corpus

IP.com Disclosure Number: IPCOM000109794D
Original Publication Date: 1992-Sep-01
Included in the Prior Art Database: 2005-Mar-24
Document File: 3 page(s) / 101K

Publishing Venue

IBM

Related People

Maruyama, H: AUTHOR [+2]

Abstract

Disclosed is a device to extract patterns automatically from a corpus. The extracted patterns are frequently used phrases or commonly used expressions in the corpus. These patterns are used to improve grammar rules or to associate their corresponding translations for tailoring a machine translation system to a user's particular domain.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Automatic Pattern Extraction from Corpus

       Disclosed is a device to extract patterns automatically
from a corpus.  The extracted patterns are frequently used phrases or
commonly used expressions in the corpus.  These patterns are used to
improve grammar rules or to associate their corresponding
translations for tailoring a machine translation system to a user's
particular domain.

      A pattern is defined in terms of subsequences and the longest
common subsequences.  A character string Z = z(1) z(2) ... z(k),
where z(i) is a character for each subscript i, is said to be a
subsequence of the character string X = x(1) x(2) ... x(m) if and
only if there exists a monotonously increasing sequence of k
subscripts i1, i2, ..., ik of X such that x(ij) = z(j) for every j
= 1, 2, ..., k. For any two character strings X and Y, a character
string Z is said to be a common subsequence if and only if Z is
a subsequence of both X and Y.  Z is the longest common subsequence
of X and Y if they have no common subsequence that is longer than Z.
It is straightforward to extend these notions for a set of n
character strings.  Finally, a pattern is the longest common
subsequence of two or more sentences (character strings) in a corpus,
but padded with a special character @ to denote one or more
contiguous characters in X or Y that are missing from Z.  By treating
the variable @ just as a single character which is distinct from any
other characters, the longest common subsequence can also be defined
for patterns.

      The disclosed device calculates a set of patterns for every
pair of sentences, and then repeat the process of finding the longest
common subsequences of patterns until the set of patterns include
reasonably common set of subsequences.  This process consists of the
following steps.  Let D be a corpus, namely the set of initial
patterns associated with a number of occurrences. Let C be a given
threshold value such that any pattern with C or more occurrences
should be considered "common" enough, and let K be the number of
patterns to be extracted.
      1.   Let I' be a subset of D such that every pattern with C or
more occurrences is in D'.
      2.   For every pair of distinct patterns p1 and p2 whose
occurrences are less than C, calculate the longest common subsequence
q.  If occurrences of p1 and p2 are m and n, respectively, let the
oc...