Browse Prior Art Database

User annotations to facilitate collaborative web crawling and indexing

IP.com Disclosure Number: IPCOM000152211D
Original Publication Date: 2007-Apr-26
Included in the Prior Art Database: 2007-Apr-26
Document File: 3 page(s) / 37K

Publishing Venue

IBM

Abstract

Inventor - Gautham B Pai IBM This publication is related to the field of search engines and how user annotations can be used to improve the accuracy of the content indexed by a search engine and subsequently the relevancy of the search results. There are 3 roles involved. The search engine provider, who defines how the crawler works, content providers who provide data that is indexed and end users, who perform the search and use the information from the search engines. In existing implementations, end users do not have say in the way the crawling happens and the kind of information that is indexed. However it is the end users who can make the most sense of the information than an automated crawler. The search engine providers have some algorithms that try to extract and provide the best information to the end users. Most of the times, the indexing happens on unstructured or semi structured data. The content providers may provide meta-information to the search engines by using meta tags or in specific formats like Resource Description Framework (RDF) (http://www.w3.org/RDF/) or Web Ontology Language (OWL) (http://www.w3.org/TR/owl-features/) or Creative Commons (http://creativecommons.org/) etc. The kind of information provided has the following problem: * It is not contextual. * It may not be accurate. * The crawling/ranking process may be affected by various Search Engine Optimization techniques used by content providers. This publication provides a mechanism by which end users can define the kind of information that is provided by content providers and also rank the content. The key features of this methodology are: 1. End users annotating content found on the World Wide Web. 2. An ability to identify multiple parts of pages and multiple such pages simultaneously and then being able to annotate them with relevant information. 3. Being able to use this annotated information in the crawling process to make the index semantically richer and more relevant than with automated crawling and indexing. 4. The annotation can be in the form of ?facts? (or triples in RDF terminology). This can be used to make the index and search more accurate because of the fact that human cognition has a better judgment on relevance than an automated tool. 5. Affecting the rank of a page with information available via user annotations.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 3

User annotations to facilitate collaborative web crawling and indexing

Inventor - Gautham B Pai

IBM

There have been works on how to obtain user annotations and how this information is stored and presented to the user. This publication describes how these annotations can be used in the crawling and indexing process of a search engine to make the indexed information more accurate and relevant to the users.

The steps involved in this procedure are as follows:
Users annotate the content.


1.


2.


3.


4.


5.


6.

User annotations is not a new concept. There has been mention on how interfaces can be provided to users to create annotations of web resources and how this can be stored and presented back to the users. [1]

Users come across some information either by using a search engine or some other means. The users would want to annotate this information in different ways. Annotation can be as simple as attaching a set of keywords (tags) with this information or providing facts about the information in the form of triples (ex: RDF) or other similar ways. The user interface may be a plug-in in the browser, which interacts with the web-crawler asynchronously or could be a page hosted by the search engine providers, where the required information can be provided. The basic interface requires the user to provide the following information:

The URL pattern of the pages where some information is present.

The XPath regular expression to the specific content in the page.

The annotation information that the user wants to associate with the content that is present at the specified section of the page.

The URL patterns can be supplied as regular expressions to cover a larger range of pages. Each resource that matches this regular expression has a specific section which has information of importance. This section can be identified by an XPath to the information. The XPath expressions may themselves contain regular expressions. Finally the information is annotated using either labels (or tags) or as facts, in the form of triples (the information 'x' at this particular XPath, is talking about 'y') or other ways.

There has been some work on use of user information in the indexing process. For example, Morris et.al. [2] describe the use of an indexing process using information from clients' browsers. This publication however differs in the way that users contribute

1

Search engine crawlers come across pages as part of the crawling process.

Search engines create the index considering the annotation information provided by

the user.

Users perform search based on special tags.

Users rate the content.

Search engines use the ratings to refine the ranking of the results.

Page 2 of 3

directly to the indexing process.

Example

Consider a website offering information about various cities. This site provides information like the time offset from GMT or UTC, the latitude and longitude of this city, the dialing code for the city etc....