Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Datamodel for fast access of context of textual data

IP.com Disclosure Number: IPCOM000015323D
Original Publication Date: 2001-Nov-10
Included in the Prior Art Database: 2003-Jun-20
Document File: 2 page(s) / 54K

Publishing Venue

IBM

Abstract

Datamodel for fast access of context of textual data Disclosed is a datamodel that allows for fast access of contextual information of textual data. One of the applications of this datamodel is search. Searching a textual collection of documents is widespread, with many searches providing not satisfying results. This is partially due to the short query length and hence provides not enough contexts. It is widely accepted that the relevance feedback method improves results. Briefly, this method looks at the first n (where n is system specified) result documents of the query, picks appropriate words from these documents and adds them to the original query. This new expanded query is again submitted to the corpus and a new result list is produced. There are many variations of this algorithm in the art. The difficult question is which words to select from the original result list to add to the query. Too many words lead to long queries that take longer to execute, too many words can also lead to a lot of noise in the results. Furthermore, it seems obvious, that different applications would have different requirements in terms of which additional words to select. It would be prudent to annotate (a subset of) words in the collection with sufficient metadata, so that the appropriate words used to expand a query can be selected fast based on the values of the metadata. Clearly, the meta data would be precomputed. We propose the following schema. Each document should be tokenized into words, each word should be stemmed, using a stemmer which can be also applied to a query itself too. Associate an integer with each word that is its sequence number in the document. Associate with each word its part of speech tag as well as its paragraph and sentence number. Note that each part of speech tag is assigned a unique integer number. All this information can be loaded into a database, each row being a word, its document id and all the integers representing the information just described. It has been shown in the art that words in the result documents that are in the vicinity of the query words, are good words to use to augment the query. Using the novel approach of sequence numbers, the vicinity is easily computed using fast integer arithmetic. First, the sequence number sN of a query term is determined, and then the terms which have a sequence number within a range of sN are determined. This fast query can be further refined. Here are some examples:

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Datamodel for fast access of context of textual data

Datamodel for fast access of context of textual data

Disclosed is a datamodel that allows for fast access of contextual information of textual data. One of the applications of this datamodel is search. Searching a textual collection of documents is widespread, with many searches providing not satisfying results. This is partially due to the short query length and hence provides not enough contexts. It is widely accepted that the relevance feedback method improves results. Briefly, this method looks at the first n (where n is system specified) result documents of the query, picks appropriate words from these documents and adds them to the original query. This new expanded query is again submitted to the corpus and a new result list is produced. There are many variations of this algorithm in the art. The difficult question is which words to select from the original result list to add to the query. Too many words lead to long queries that take longer to execute, too many words can also lead to a lot of noise in the results. Furthermore, it seems obvious, that different applications would have different requirements in terms of which additional words to select. It would be prudent to annotate (a subset of) words in the collection with sufficient metadata, so that the appropriate words used to expand a query can be selected fast based on the values of the metadata. Clearly, the meta data would be precomputed.

We propose the following schema. Each document should be tokenized into words, each word should be stemmed, using...