Browse Prior Art Database

Italian Suffix Table for Extracting Word Stems

IP.com Disclosure Number: IPCOM000118704D
Original Publication Date: 1997-May-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 2 page(s) / 65K

Publishing Venue

IBM

Related People

Porter, TW: AUTHOR

Abstract

Disclosed is a suffix rules file specific to the Italian language which is to be used in conjunction with the Paice stemming algorithm.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Italian Suffix Table for Extracting Word Stems

      Disclosed is a suffix rules file specific to the Italian
language which is to be used in conjunction with the Paice stemming
algorithm.

      Stemming words is one approach to generating a search and
retrieval index which is smaller and more efficient than a full text
index.  One stemming algorithm which has been published is the Paice
Stemmer.  It requires a list of suffixes and rules to apply when
removing the suffixes to generate a word stem.  A suffix rules file
has been published for English but not for other languages.  The
algorithm shows a listing of the disclosed Italian suffix rules file.
Note that  the application of the algorithm is not case sensitive.

      The Paice algorithm works by parsing a "token" or "word" from
the input stream.  It then reverses the order of the characters in
the token and compares that to the suffix rules file.  The suffix
rules file  is checked in the order given, from the first line to the
last line. If  the first token of one of the rules in the suffix
rules file is an exact  subset of the first <n> characters of the
input token, then the rule is  applied.  The rule is the remaining
tokens of that line in the suffix rules file.

      The optional second token in a suffix rules file line may be an
asterisk (*).  This indicates that the rule is to be applied only if
this is the first rule which matched the input token.  The next token
in the  suffix rules file line is...