Browse Prior Art Database

Minimum Redundancy Parts-Of-Speech Data Storage Technique

IP.com Disclosure Number: IPCOM000042320D
Original Publication Date: 1984-May-01
Included in the Prior Art Database: 2005-Feb-03
Document File: 1 page(s) / 11K

Publishing Venue

IBM

Related People

Carlgren, RG: AUTHOR

Abstract

This technique minimizes the storage requirement to represent the basic parts of speech of a dictionary word list. Eight primary parts of speech are used to characterize words in European languages. These are "noun", "verb", "adjective", "adverb", "preposition", "conjunction", "pronoun", and "interjection". Many words can have multiple parts of speech. The number of different combinations of different parts of speech for the various words in the English language prevent the use of a number of less than eight bits to represent the parts of speech of a word. This technique provides for the representation of parts of speech in an average of much less than eight bits per stored word.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 89% of the total text.

Page 1 of 1

Minimum Redundancy Parts-Of-Speech Data Storage Technique

This technique minimizes the storage requirement to represent the basic parts of speech of a dictionary word list. Eight primary parts of speech are used to characterize words in European languages. These are "noun", "verb", "adjective", "adverb", "preposition", "conjunction", "pronoun", and "interjection". Many words can have multiple parts of speech. The number of different combinations of different parts of speech for the various words in the English language prevent the use of a number of less than eight bits to represent the parts of speech of a word. This technique provides for the representation of parts of speech in an average of much less than eight bits per stored word. This technique for storing parts-of-speech data is to exploit the frequency distribution of the valid combinations of parts of speech which occur in European languages. In English, most words, statistically, can have one or more of the following parts of speech: "noun", "verb", "adjective", and "adverb". The various combinations of these parts of speech can be represented in four bits. Since having all four or none of these parts of speech is highly unlikely, then a mask of all bits on or off can be used as a flag to indicate that the actual parts of speech are encoded in the following eight bits. Hence, a bit mask representation of all valid parts of speech for a word must be either four bits long or twelve bits long. It is the s...