Browse Prior Art Database

Gap Recurrence: A Lexicostatistical Measure

IP.com Disclosure Number: IPCOM000128535D
Original Publication Date: 1975-Dec-31
Included in the Prior Art Database: 2005-Sep-16
Document File: 9 page(s) / 32K

Publishing Venue

Software Patent Institute

Related People

Jay Leavitt: AUTHOR [+4]

Abstract

For the past year we have been at work on a l.exicostatistical study which we believe may prove to be of considerable s iKnif. tcance. Tradit iou;o 1 dtstri-butional studies have relied upon two measures of the rulatIvo rlchilu::s ul~ vocabulary of a given corpus: (I) type-token ratios (TTR) and (I1) Yule's K (characteristic). But investigation by Wachal and Spreen (1970) has shown that only mean segmental.TTR is at all reliable in projecting from a sample to a population; and while Yule's K is said to be independent of the length of the text being studied, it has been the subject of attack (Ross, 1950) and of redefinition (Herdan, 1955). We have developed .a number of related alter-native measures, each of which complements TTR's~and Yule's K. .Preliminary results suggest that these measures will have the following advantages: (1) they perform essentially the same function as the TTR; (2) they extrapolate from the sample to the whole consistently; (3) they contain other informa-tion, including a rate of richness measure; and (4) they may prove to be objective measures of one aspect of what we mean by "style" in language. The study of what we call Gap Recurrence has been undertaken before on a small scale-especially for the study of alliterative patterns. But what we are. undertaking is both a horizontal (diachronic) and vertical (synchronic) in-vestigation of the 'clustering' characteristics of natural-language phenomena. We are not, however, interested in the way that different words habitually cluster or 'collocate'. but in the way in which the same word 'clusters' or does not in a given corpus-information unavailable through any of the stan-dard measures heretofore described.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 13% of the total text.

Page 1 of 9

THIS DOCUMENT IS AN APPROXIMATE REPRESENTATION OF THE ORIGINAL.

Gap Recurrence: A Lexicostatistical Measure

by

Jay Leavitt and Larry Mitchell

04 08 Department of Computer Science Institute of Technology University of Minnesota Minneapolis, Minnesota 55455

Technical Report 75*8 . April, 1975 Cover courtesy of Ruth and Jay Leavitt Gap Recurrence: A Lexicostatastical Measure by Jay Leavitt and Larry Mitchell Computer Science English Department Department University of Minnesota Minneapolis, Minnesota

Abstract

For the past year we have been at work on a l.exicostatistical study which we believe may prove to be of considerable s iKnif. tcance. Tradit iou;o 1 dtstri-butional studies have relied upon two measures of the rulatIvo rlchilu::s ul~ vocabulary of a given corpus: (I) type-token ratios (TTR) and (I1) Yule's K (characteristic). But investigation by Wachal and Spreen (1970) has shown that only mean segmental.TTR is at all reliable in projecting from a sample to a population; and while Yule's K is said to be independent of the length of the text being studied, it has been the subject of attack (Ross, 1950) and of redefinition (Herdan, 1955). We have developed .a number of related alter-native measures, each of which complements TTR's~and Yule's K.
.Preliminary results suggest that these measures will have the following advantages: (1) they perform essentially the same function as the TTR; (2) they extrapolate from the sample to the whole consistently; (3) they contain other informa-tion, including a rate of richness measure; and
(4) they may prove to be objective measures of one aspect of what we mean by "style" in language. The study of what we call Gap Recurrence has been undertaken before on a small scale-especially for the study of alliterative patterns. But what we are. undertaking is both a horizontal (diachronic) and vertical (synchronic) in-vestigation of the 'clustering' characteristics of natural-language phenomena. We are not, however, interested in the way that different words habitually cluster or 'collocate'. but in the way in which the same word 'clusters' or does not in a given corpus-information unavailable through any of the stan-dard measures heretofore described. GAP RECURRENCE: A LEXICOSTATISICAL MEASURE

Since the summer of 1974 we have been at work on the develcpnent of a iexicostatistical measure, which our research to date suggests may be of considerable sensitivity. Traditional distributional studies have relied heavily upon two measures of . the relative 'richness' of the vocabulary of a given corpus: (1) type-token ratio (TTR) and (ii) Yule's characteristic (K) . Because the raw TTR is sample-size dependent, a number of alternatives have been developed, each of which offers different advantages. Most recently, investigation by Wachal and Spreen (1970) has shown that only mean segmental type-token ratio (MSTTR)-, an average of the TTR's in consecutive samples of the same size-is at all reliable in projecti...