Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

System and Technique for Automatically Detecting a Distributed Component Structure of Web Based Documents using a Cluster Crawler Analyzer

IP.com Disclosure Number: IPCOM000014891D
Original Publication Date: 2002-Jun-01
Included in the Prior Art Database: 2003-Jun-20
Document File: 8 page(s) / 128K

Publishing Venue

IBM

Abstract

Automatically Detecting a Distributed Component Structure of Web Based Documents using a Cluster Crawler Analyzer The system is related to the area of information retrieval technologies in the context of distributed documents and document/information classification techniques. We will describe: 1. Problem Statement 2. Proposed Solution

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 18% of the total text.

Page 1 of 8

  System and Technique for Automatically Detecting a Distributed Component Structure of Web Based Documents using a Cluster Crawler Analyzer

  Automatically Detecting a Distributed Component Structure of Web Based Documents using a Cluster Crawler Analyzer

    The system is related to the area of information retrieval technologies in the context of distributed documents and document/information classification techniques. We will describe:

1. Problem Statement
2. Proposed Solution
3. Advantages and Benefits

1. Problem Statement

    Typically, most documents have well defined structures. A book, for instance, is composed of chapters, each of which may be comprised of one or more sections. Each section, in turn, is constituted of paragraphs, which are formed from sentences. Similarly, more specialized document types such as resumes also adhere to a well-defined structure. Specifically, virtually every resume has sections corresponding to "Objective," "Education," "Employment History," "Skills" etc. Conventionally, most such structured documents have been presented in a unified, monolithic, block format. With the rapid surge in the popularity of the Internet and hyperlinks, however, this paradigm has begun to shift. Many structured documents are now often presented in a distributed fashion. Specifically, for instance, many people have begun posting their resumes on their Web page in a distributed manner with each section corresponding to a different URL, and linked to the other sections through hyperlinks.

    The problem we seek to address is that automated Web crawling tools are not well suited for crawling and retrieving such distributed documents. Web crawlers work by matching the name of the page or the URL against some pre-specified set of keywords, and based upon the resultant match, conclude that the page contains the document or information being sought. Clearly, with distributed documents this is not the case, since, by definition, the document is distributed across multiple pages, each of which may correspond to a different and unique URL. As such, it is desirable to develop a technique by which tools such as Web crawlers can infer the scope of a document, and exploit contextual information to retrieve it in its entirety.

    For instance, a Web crawling tool could be told about the expected, standard structure of the document it is being used to retrieve. Further, the crawler could be given some heuristics by which these structural components may be (probabalistically) identified and delineated. Such heuristics can include, for instance, a list of content keywords and/or a list of possible sub-structural components that the crawler should look for in a given "candidate." Given the extent of match with such heuristics, the crawler can readily determine a "match probability." If the determined "match probably" is higher than some predefined threshold, the candidate page ought to be accepted. Subsequently, the crawler should compile a listing of a...