Browse Prior Art Database

Discovering Multiword Clumps for Machine Translation

IP.com Disclosure Number: IPCOM000105286D
Original Publication Date: 1993-Jul-01
Included in the Prior Art Database: 2005-Mar-19
Document File: 4 page(s) / 83K

Publishing Venue

IBM

Related People

Brown, PF: AUTHOR [+4]

Abstract

In producing a machine translation system, it is important to make use of a list of word pairs which do not translate compositionally. An example of such a pair in English is hot dog, which one would not want to translate into French as chaud chien. In this paper, we describe a method of determining such word clumps from a large bilingual corpus of text in which words in one language have been aligned with their translations in another language. [*] In this paper, we imagine that individual words in a source language have been aligned with sets of words in a target language.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Discovering Multiword Clumps for Machine Translation

      In producing a machine translation system, it is important to
make use of a list of word pairs which do not translate
compositionally.  An example of such a pair in English is hot dog,
which one would not want to translate  into  French as  chaud  chien.
In this paper, we describe a method of determining such word clumps
from a large  bilingual  corpus of  text  in  which  words in one
language have been aligned with their translations in another
language.  [*]  In this paper, we imagine that individual words in a
source language have been aligned with sets of words in a target
language.

      Consider clumping together two target words t sub 1 and t sub 2
They are clumped together if generating them as a unit leads to a
significant increase in log probability, f of the generation of the
target sentences from the source sentences.  Suppose there is a large
number of aligned sentences.  The following definitions are made:

      Given the counts from the alignments, the new probabilities
were estimated in the following fashion:

      In computing the change, &del.f, in the log probability  of the
target sentences given the source sentences, only the effect of the
translation output probabilities were considered: the effect of the
n-probabilities and the distortion probabilities was ignored.  &del.f
can be broken into three parts:

      If instead of the change to the log probability of the...