Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Gap Recurrence: A Lexicostatistical Measure

IP.com Disclosure Number: IPCOM000128535D
Original Publication Date: 1975-Dec-31
Included in the Prior Art Database: 2005-Sep-16
Document File: 9 page(s) / 32K

Publishing Venue

Software Patent Institute

Related People

Jay Leavitt: AUTHOR [+4]

Abstract

For the past year we have been at work on a l.exicostatistical study which we believe may prove to be of considerable s iKnif. tcance. Tradit iou;o 1 dtstri-butional studies have relied upon two measures of the rulatIvo rlchilu::s ul~ vocabulary of a given corpus: (I) type-token ratios (TTR) and (I1) Yule's K (characteristic). But investigation by Wachal and Spreen (1970) has shown that only mean segmental.TTR is at all reliable in projecting from a sample to a population; and while Yule's K is said to be independent of the length of the text being studied, it has been the subject of attack (Ross, 1950) and of redefinition (Herdan, 1955). We have developed .a number of related alter-native measures, each of which complements TTR's~and Yule's K. .Preliminary results suggest that these measures will have the following advantages: (1) they perform essentially the same function as the TTR; (2) they extrapolate from the sample to the whole consistently; (3) they contain other informa-tion, including a rate of richness measure; and (4) they may prove to be objective measures of one aspect of what we mean by "style" in language. The study of what we call Gap Recurrence has been undertaken before on a small scale-especially for the study of alliterative patterns. But what we are. undertaking is both a horizontal (diachronic) and vertical (synchronic) in-vestigation of the 'clustering' characteristics of natural-language phenomena. We are not, however, interested in the way that different words habitually cluster or 'collocate'. but in the way in which the same word 'clusters' or does not in a given corpus-information unavailable through any of the stan-dard measures heretofore described. GAP RECURRENCE: A LEXICOSTATISICAL MEASURE Since the summer of 1974 we have been at work on the develcpnent of a iexicostatistical measure, which our research to date suggests may be of considerable sensitivity. Traditional distributional studies have relied heavily upon two measures of . the relative 'richness' of the vocabulary of a given corpus: (1) type-token ratio (TTR) and (ii) Yule's characteristic (K) . Because the raw TTR is sample-size dependent, a number of alternatives have been developed, each of which offers different advantages. Most recently, investigation by Wachal and Spreen (1970) has shown that only mean segmental type-token ratio (MSTTR)-, an average of the TTR's in consecutive samples of the same size-is at all reliable in projecting from a sample to a population. And Yule's K, thouah not sample-size dependent, has been the subject of attack (Ross,1950) and of redefinition (Herdan,1955,I956). Our family of measures complements Yule's K and the variations on TTR insofar as: (1) it is also a measure of distributional richness (2) it contains other information,including a rate of richness measure, which gives additional characteristics of the distributional profile. i t appears that, with some adjustment for overlapping, samples, the extrapolation from the sample to the whole is relatively consistent (4) it may prove to be an objective measure of one aspect of what z~:e mean by 'style' in language. For example, preliminary work suggests that we may be able to quantify native-speaker perception of relative alliteration in prose and poetry (see Hacker and Leavitt,1975); and we may even be able to discriminate among possible properties of language as such,e.g. the relative distribution of prepositions and determiners, versus possible corpus-specific properties,e.g. the distribution and manner of distribution of nouns and verbs. The study of what the call GAP RECURRENCE has been undertaken before on a small scale - especially for the study of alliterative patterns (see Skinrier,1939,1941); Spang-Hanssen,1956; Wright,1974). But since Bailey (1971) has, not without reason, expressed grave doubts about the viability of a number of lexical measures, including gap distribution, we should probably mention at this point that our gap measure is, as far as we can judge, rather different from all previous versions of gapping -.a fact which should become clear in the course of our presentation. Ultimately our goal is to find both horizontal (diachronic) and vertical (synchronic) 'clustering' characteristics of natural-language phenomena. Ile are not, however, interested in the way that different words habitually cluster or 'collocate' (see Firth,1957; Berry-Rogghe,~1973,1974), but in the way in which the same word 'clusters' or does not in a.given corpus-information unavailable through any of the standard measures heretofore described.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 13% of the total text.

Page 1 of 9

THIS DOCUMENT IS AN APPROXIMATE REPRESENTATION OF THE ORIGINAL.

Gap Recurrence: A Lexicostatistical Measure

by

Jay Leavitt and Larry Mitchell

04 08 Department of Computer Science Institute of Technology University of Minnesota Minneapolis, Minnesota 55455

Technical Report 75*8 . April, 1975 Cover courtesy of Ruth and Jay Leavitt Gap Recurrence: A Lexicostatastical Measure by Jay Leavitt and Larry Mitchell Computer Science English Department Department University of Minnesota Minneapolis, Minnesota

Abstract

For the past year we have been at work on a l.exicostatistical study which we believe may prove to be of considerable s iKnif. tcance. Tradit iou;o 1 dtstri-butional studies have relied upon two measures of the rulatIvo rlchilu::s ul~ vocabulary of a given corpus: (I) type-token ratios (TTR) and (I1) Yule's K (characteristic). But investigation by Wachal and Spreen (1970) has shown that only mean segmental.TTR is at all reliable in projecting from a sample to a population; and while Yule's K is said to be independent of the length of the text being studied, it has been the subject of attack (Ross, 1950) and of redefinition (Herdan, 1955). We have developed .a number of related alter-native measures, each of which complements TTR's~and Yule's K.
.Preliminary results suggest that these measures will have the following advantages: (1) they perform essentially the same function as the TTR; (2) they extrapolate from the sample to the whole consistently; (3) they contain other informa-tion, including a rate of richness measure; and
(4) they may prove to be objective measures of one aspect of what we mean by "style" in language. The study of what we call Gap Recurrence has been undertaken before on a small scale-especially for the study of alliterative patterns. But what we are. undertaking is both a horizontal (diachronic) and vertical (synchronic) in-vestigation of the 'clustering' characteristics of natural-language phenomena. We are not, however, interested in the way that different words habitually cluster or 'collocate'. but in the way in which the same word 'clusters' or does not in a given corpus-information unavailable through any of the stan-dard measures heretofore described. GAP RECURRENCE: A LEXICOSTATISICAL MEASURE

Since the summer of 1974 we have been at work on the develcpnent of a iexicostatistical measure, which our research to date suggests may be of considerable sensitivity. Traditional distributional studies have relied heavily upon two measures of . the relative 'richness' of the vocabulary of a given corpus: (1) type-token ratio (TTR) and (ii) Yule's characteristic (K) . Because the raw TTR is sample-size dependent, a number of alternatives have been developed, each of which offers different advantages. Most recently, investigation by Wachal and Spreen (1970) has shown that only mean segmental type-token ratio (MSTTR)-, an average of the TTR's in consecutive samples of the same size-is at all reliable in projecti...