Browse Prior Art Database

Method and System for Determining Relatedness of Words

IP.com Disclosure Number: IPCOM000200457D
Publication Date: 2010-Oct-14
Document File: 3 page(s) / 32K

Publishing Venue

The IP.com Prior Art Database

Related People

Radhanikanth GVR: INVENTOR [+2]

Abstract

A method and system for determining relatedness of words in order to improve a search experience is disclosed.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 49% of the total text.

Method and System for Determining Relatedness of Words

Abstract

A method and system for determining relatedness of words in order to improve a search experience is disclosed.

Description

Disclosed is a method and system for determining relatedness of words by using word co-occurrence patterns in order to improve a search experience.

Typically, a web index owned by a search company is organized as a word-document matrix in a format of an augmented inverted index.  The augmented inverted index contains for every word in the vocabulary, entries corresponding to each document where the word occurs and also additional information like positions of the word in the document, number of occurrences, HTML type of the word (eg. title, heading etc) and also the font size and type.  These documents are sorted or ranked in an order of their static rank.  The static rank of each document is determined using the inherent link structure of the web. 

The method and system disclosed herein uses this web index to extract information regarding relatedness of words.  The method involves obtaining a list of the top N document IDs (results) for each word.  Since the Documents IDs are sorted by decreasing static rank, the best pages that include the word are obtained.  However, this may not necessarily be the most relevant documents.  Thereafter, for each document ID in the list, the forward index (if present) or a document store is used to extract the closest 'm' non-trivial words and store it in a list belonging to the word under consideration.  Here, ‘m’ may be tuned to get as less noise as possible.  The non-trivial words imply that the word is not a stop word or a high frequency word in the language. 

Each entry in the list consists of a word, its proximity (word distance) to the word under consideration, and also part-of-speech (POS) tags of the word.  A word without a POS tag may be assumed to be a proper noun.  This information can be stored in a posting list where, instead of documents, individual words 'related' to the main word under consideration along with meta information are stored. 

Thereafter, the results obtained across document IDs are merged by computing counts of related words and their average distance from the word under consideration and also by retaining the POS tags.  This list is then sorted in a descending order of frequency of occurrence.  From the sorted list, 'x' high frequent words and 'y' high frequent proper nouns are selected.  Subsequently, this list of 'x+y' words are stored as related words or concepts of the original word under consideration retaining the average distance to the word and the POS information.  This process is repeated for all the words in the vocabulary of the index. 

Also, the above process may be augmented by extracting popular bigrams and trigrams and using them as base words to find words related to the bigram as a whole.  The bigrams may be generated based on the word index structure...