Browse Prior Art Database

Statistical Property of Japanese Phrase to Phrase Modifications

IP.com Disclosure Number: IPCOM000122526D
Original Publication Date: 1991-Dec-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 4 page(s) / 142K

Publishing Venue

IBM

Related People

Maruyama, H: AUTHOR [+2]

Abstract

Japanese phrases tend to attach to constituents that immediately follow them. Therefore, Japanese parsers are usually designed to generate right-branching trees unless syntactic and semantic information suggest otherwise [1]. This report describes our attempt to obtain statistical support for the above heuristic rule. We have found that the distribution of the modification distance follows the generalized formula of Zipf's law [2]. Range and Distance

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Statistical Property of Japanese Phrase to Phrase Modifications

      Japanese phrases tend to attach to constituents that
immediately follow them. Therefore, Japanese parsers are usually
designed to generate right-branching trees unless syntactic and
semantic information suggest otherwise [1]. This report describes our
attempt to obtain statistical support for the above heuristic rule.
We have found that the distribution of the modification distance
follows the generalized formula of Zipf's law [2].
Range and Distance

      We assume that a Japanese sentence is a sequence of bunsetsu,
which can be roughly translated into English as phrases.  Every
phrase other than the final one modified another phrase on its right,
so the modified candidates of the i-th phrase are i+1, i+2, ..., n,
where n is the number of phrases in the sentence.  When the i-th
phrase modifies the (i+k)-th phrase, we call k the distance of this
modification (i<k<n-i). The maximum value of the distance for the
phrase (i.e., n-i) is called its range.  In the following sentence,
both the distance and the range of the first phrase ("watashi-wa")
are 3.
watashi-wa   sakana-wo   isoide     tabe-ta.
 (I+subj)   (fish+dobj) (quickly)  (eat+past)
Distance and Frequency

      The sample data that we used consists of a corpus of 1,091
sentences taken from newspaper articles.  We have manually created a
correct parse tree (modifier-modified relationships) for each
sentence.  The total number of phrases was 9,443, and the total
number of modifications was 8,352.  The average number of phrases in
a sentence was 8.7.

      If the range of a phrase is equal to one, its distance is also
one. Therefore, the distribution of the modification distance is
affected by the range.  So that the modification range would not
affect the distance, we counted the number of modification
occurrences as a function of both distance and range.  Fig. 1 shows
the distribution of the distance when the range is from 2 to 12.
Confirming our intuition, a large portion of the phrases modify their
immediate neighbor (that is, distance=1).  In general, the
probability of modification decreases as the distance becomes
greater, but the last phrase (usually a predicate) has a strong
tendency to be a modifier of a phrase.

      A closer examination of the figures showed that the portions of
all the graphs are roughly identical regardless of their range.  For
instance, the percentages of modifications are 63.1, 54,8, 56.9,
54.3, 57.6, 52.8, ..., so they are around 56%.  For distance=2, the
perce...