Browse Prior Art Database

Probabilistic Method of Aligning Sentences with their Translations using Word Cognates

IP.com Disclosure Number: IPCOM000111445D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 92K

Publishing Venue

IBM

Related People

Brown, PF: AUTHOR [+4]

Abstract

Disclosed is a method for aligning sentences in parallel texts from two different natural languages. Although the method is described for a specific pair of languages, viz. French and English, any other pair could be substituted.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 50% of the total text.

Probabilistic Method of Aligning Sentences with their Translations
using Word Cognates

      Disclosed is a method for aligning sentences in parallel texts
from two different natural languages.  Although the method is
described for a specific pair of languages, viz.  French and English,
any other pair could be substituted.

      Recently, [1]  demonstrated that it is possible to obtain
reasonably accurate alignments of sentences in bilingual corpora
using a statistical method that depends only on the number of words
in the sentences.  The same method can be applied to the number of
characters in the sentences, as shown in [2].  The method disclosed
herein combines information on the lengths of the sentences with
information about cognates in the two languages to obtain better
alignments than have been obtained previously.

      A set of cognates for a pair of languages is a pairing of some
of the words in one of the languages with some of the words in the
other language.  A set of cognates for French and English might, for
example, include pairs like (oui, yes), (non, no), (Smith, Smith),
(1991, 1991), etc.  Finding cognates may in general require some
knowledge of the two languages being considered, but as the last two
examples in the previous sentence show there are many cognates in the
form of personal names and sequences of digits that can be found
without any knowledge of the languages in question.  For the purposes
of the method described here, it is not necessary that the list of
cognates be exhaustive.  Even if the list of cognates is emtpy, the
method will perform no worse than the method of Brown et al referred
to above.

      The alignment of sentences in parallel French and English
copora in terms of a sequence of beads was defined in [1].  Each bead
accounts for either 1) a single English sentence, 2) a single French
sentence, 3) an English sentence and a French sentence, 4) two
English sentences and one French sentence, or 5) one English sentence
and two French sentences.  This inventory of beads can easily be
extended, for example to beads accounting for two English and two
French sentences, but this additional generality does not add to
one's grasp of the ideas involved.  Sentences are said to be aligned
with one another if they fall in the same bead.  The alignment
problem, then is to find the most probable sequence of beads given a
particular parallel corpus.

      Let  E sub 1 sup m = E sub 1 , E sub 2 , ellipsis , E sub m be
the sequence of English sentences in a parallel corpus, and let F sub
1 sup n = F sub 1 , F sub 2 , ellipsis , F sub n  be the
corresponding sequence of French sentences.  Let  C  be the set of
cognates for the corpus.  Assign to each pair of cognates  c sub i  a
randomly chosen integer  h sub i  between  1 and  h  where  h  is
some small value like  32 .  The integer  h sub i  is called the hash
of the cognates in c sub i .  Usually, many diffe...