Browse Prior Art Database

A System for Segment Retrieval in Web Content Disclosure Number: IPCOM000198573D
Publication Date: 2010-Aug-09
Document File: 4 page(s) / 144K

Publishing Venue

The Prior Art Database


Disclosed is a system and method for querying text documents available on the Internet and retrieving relevant segments of content. The system includes a user-friendly interface with flexible querying and retrieval options.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 40% of the total text.

Page 1 of 4

A System for Segment Retrieval in Web Content

The evolving Internet world has given rise to several new technologies providing users with novel ways to post and retrieve information: web-based communities, social-networking sites, wikis, blogs, mashups and folksonomies serve as interactive platforms where users can exchange information. A large portion of this content is in the form of text passages of arbitrary length, discussing a variety of topics.

When target documents are large, a user may have difficulty running a query and successfully finding and retrieving segments within this document that are relevant to a particular topic.

Existing approaches to this problem focus on indexing pieces of text that represent meaningful text entities (e.g., paragraphs, chapters, or even entire documents) [1, 2]. At retrieval time, the document retrieval system returns these a-priori defined entities to the user. For example, the result of a keyword query can be an entire book chapter, while only a small fragment of the chapter may be truly relevant to the user's interests. In addition, the right choice of the entity to be indexed is not straightforward and may vary with the type of text. Furthermore, existing methods index these entities using a bag-of-words (BOW) approach; they only consider the appearance or frequency of a term (or phrase), ignoring the document's semantic continuity and topical structure. As a result, entities with a relatively small overall number of occurrences of the query terms might be overlooked, even if these occurrences are close to each other in the actual text, forming locally-cohesive segments.

Finally, the majority of existing information-retrieval mechanisms rely on standard keyword queries. A major problem of such queries is that they often prove incapable of capturing the actual query intent of the user. Forming an effective query is often a challenging task for users, especially when doing so requires domain-specific knowledge.

The solution described here is an efficient framework for finding and retrieving parts of documents that are relevant to a user query. Within this disclosure, parts of documents are referred to as document segments or simply segments. The solution provides:

• A flexible representation for queries that can accurately capture the user's search intent

• Simple and intuitive scoring functions for segments of text, able to accurately capture the relevance between a segment and a given query

• A linear-time, parameter-free algorithm that identifies and ranks relevant segments

• A system, Seren, which is a user-friendly tool written in Java* which efficiently


Page 2 of 4

implements the methodology

The invention, as a retrieval mechanism, does not rely on any index structure. Instead of focusing on a-priori defined text entities, the method dynamically retrieves locally cohesive pieces of text that are highly relevant to a user's query. During the retrieval process, the system a...