Browse Prior Art Database

Text Indexing Using Relational Database Management System (RDBMS)

IP.com Disclosure Number: IPCOM000012739D
Original Publication Date: 2003-May-23
Included in the Prior Art Database: 2003-May-23
Document File: 4 page(s) / 49K

Publishing Venue

IBM

Abstract

Described is a process for performing general text searches for collections of hyperlinked documents.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 4

Text Indexing Using Relational Database Management System (RDBMS)

  General text search for collections of hyperlinked documents is a major problem. Its key aspects are delivering fast response with the most relevant documents ranked first. This invention solves the problem using a relational database with a simple two table schema and an equally simple queries.

The Schema:

create database novo;

create table page_tab ( pid int, // relevance ranked document identifier url char(200)); // handle to document

create table ref_tab( wid int, // word identifier produced by hash lid int, // ranked location of word in document pid int, // document identifier
minlid int); // most salient location of word in document

create index fact on table ref_tab ( wid, lid, pid, minlid);

Parsing Documents for Indexing:

Documents are collections of words structured into components. Each component contains some of the words of the document. Components are assigned location ids which rank the salience of words located in that component. An example collection of documents is a set of HTML documents. In HTML documents , words can be defined as alphanumeric strings. We can define arbitrary salience structure for HTML documents as follows:

1. words in the URL
2. words in the title
3. words in the description or meta keywords
4. words in the headlines
5. words in the body of the text

Documents are parsed into a sequence of words and word pairs. Each word will appear once at its most salient location, and be cascaded to each lower salence location. Consider the following simple HTML document, without title or meta keywords:

www.dog.com: <html> Dogs are nice. </html>

1

Page 2 of 4

when parsed produces ref_tab entries:

word lid pid minlid ---- --- --- ------ www 1 1 1 www_dog 1 1 1 dog 1 1 1 dog_com 1 1 1 com 1 1 1

www 2 1 1 www_dog 2 1 1 dog 2 1 1 dog_com 2 1 1 com 2 1 1

www 3 1 1 www_dog 3 1 1 dog 3 1 1 dog_com 3 1 1 co...