User annotations to facilitate collaborative web crawling and indexing
Original Publication Date: 2007-Apr-26
Included in the Prior Art Database: 2007-Apr-26
Inventor - Gautham B Pai IBM This publication is related to the field of search engines and how user annotations can be used to improve the accuracy of the content indexed by a search engine and subsequently the relevancy of the search results. There are 3 roles involved. The search engine provider, who defines how the crawler works, content providers who provide data that is indexed and end users, who perform the search and use the information from the search engines. In existing implementations, end users do not have say in the way the crawling happens and the kind of information that is indexed. However it is the end users who can make the most sense of the information than an automated crawler. The search engine providers have some algorithms that try to extract and provide the best information to the end users. Most of the times, the indexing happens on unstructured or semi structured data. The content providers may provide meta-information to the search engines by using meta tags or in specific formats like Resource Description Framework (RDF) (http://www.w3.org/RDF/) or Web Ontology Language (OWL) (http://www.w3.org/TR/owl-features/) or Creative Commons (http://creativecommons.org/) etc. The kind of information provided has the following problem: * It is not contextual. * It may not be accurate. * The crawling/ranking process may be affected by various Search Engine Optimization techniques used by content providers. This publication provides a mechanism by which end users can define the kind of information that is provided by content providers and also rank the content. The key features of this methodology are: 1. End users annotating content found on the World Wide Web. 2. An ability to identify multiple parts of pages and multiple such pages simultaneously and then being able to annotate them with relevant information. 3. Being able to use this annotated information in the crawling process to make the index semantically richer and more relevant than with automated crawling and indexing. 4. The annotation can be in the form of ?facts? (or triples in RDF terminology). This can be used to make the index and search more accurate because of the fact that human cognition has a better judgment on relevance than an automated tool. 5. Affecting the rank of a page with information available via user annotations.