Browse Prior Art Database

User annotations to facilitate collaborative web crawling and indexing

IP.com Disclosure Number: IPCOM000152211D
Original Publication Date: 2007-Apr-26
Included in the Prior Art Database: 2007-Apr-26
Document File: 3 page(s) / 37K

Publishing Venue

IBM

Abstract

Inventor - Gautham B Pai IBM

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 3

User annotations to facilitate collaborative web crawling and indexing

Inventor - Gautham B Pai

IBM

There have been works on how to obtain user annotations and how this information is stored and presented to the user. This publication describes how these annotations can be used in the crawling and indexing process of a search engine to make the indexed information more accurate and relevant to the users.

The steps involved in this procedure are as follows:
Users annotate the content.


1.


2.


3.


4.


5.


6.

User annotations is not a new concept. There has been mention on how interfaces can be provided to users to create annotations of web resources and how this can be stored and presented back to the users. [1]

Users come across some information either by using a search engine or some other means. The users would want to annotate this information in different ways. Annotation can be as simple as attaching a set of keywords (tags) with this information or providing facts about the information in the form of triples (ex: RDF) or other similar ways. The user interface may be a plug-in in the browser, which interacts with the web-crawler asynchronously or could be a page hosted by the search engine providers, where the required information can be provided. The basic interface requires the user to provide the following information:

The URL pattern of the pages where some information is present.

The XPath regular expression to the specific content in the page.

The annotation information that the user wants to associate with the content that is present at the specified section of the page.

The URL patterns can be supplied as regular expressions to cover a larger range of pages. Each resource that matches this regular expression has a specific section which has information of importance. This section can be identified by an XPath to the information. The XPath expressions may themselves contain regular expressions. Finally the information is annotated using either labels (or tags) or as facts, in the form of triples (the information 'x' at this particular XPath, is talking about 'y') or other ways.

There has been some work on use of user information in the indexing process. For example, Morris et.al. [2] describe the use of an indexing process using information from clients' browsers. This publication however differs in the way that users contribute

1

Search engine crawlers come across pages as part of the crawling process.

Search engines create the index considering the annotation information provided by

the user.

Users perform search based on special tags.

Users rate the content.

Search engines use the ratings to refine the ranking of the results.

Page 2 of 3

directly to the indexing process.

Example

Consider a website offering information about various cities. This site provides information like the time offset from GMT or UTC, the latitude and longitude of this city, the dialing code for the city etc....