Browse Prior Art Database

Method and System for Assigning Documents to Search Engines Located at Multiple Sites

IP.com Disclosure Number: IPCOM000210362D
Publication Date: 2011-Aug-31
Document File: 4 page(s) / 74K

Publishing Venue

The IP.com Prior Art Database

Related People

Flavio Junqueira: INVENTOR [+3]

Abstract

A method and system for assigning documents to search engines located at multiple sites is disclosed. A machine-learned document assignment strategy is used wherein the locality of document views in search results is used to decide upon assignments of documents to search engines located at multiple sites.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 37% of the total text.

Method and System for Assigning Documents to Search Engines Located at Multiple Sites

Abstract

A method and system for assigning documents to search engines located at multiple sites is disclosed.  A machine-learned document assignment strategy is used wherein the locality of document views in search results is used to decide upon assignments of documents to search engines located at multiple sites.

Description

Disclosed is a method and system for assigning documents to search engines located at multiple sites.  A machine learned document classification approach is used for assigning documents to search engines located at multiple sites.

Initially, one or more features are disclosed that may correlate with an occurrence count of documents in search results.  As such, a feature “region” represents a geographical region of a website from which a document is crawled by an F-region feature.  In order to accurately obtain this information, commercial search engines use a document classifier to predict the geographical regions of documents.  This classifier combines various features such as, a Uniform Resource Locator (URL) domain to assign a region to every document.  The feature “region” represents valuable information because a high fraction of user queries is regional queries that seek documents that are in a geographical neighborhood of users.

Another single content-based “language” feature, such as F-language, represents language of the document.  In order to determine the language of the documents, a text classifier may be used.  The “language” feature also provides valuable information because queries are much more likely to match documents in the same language.  Yet another feature, such as, a “document quality” feature represents the quality of the document.  The quality of documents may be computed in a variety of ways using the information extracted from its content (e.g., spam classification) or by using the link information in the web graph (e.g., PageRank).  As such, two quality metrics referred to as, F-linkQuality and F-hostQuality are evaluated.  The F-linkQuality metric uses the incoming links of the document to compute a quality value whereas the metric F-hostQuality computes the quality based on a host of the document.

Nonetheless, in certain cases, the only information available about the document is the URL.  Therefore, a “URL” feature may be used to extract four features from the URL: F-length, F-port, F-query, and F-depth.  The F-length represents the length of the URL whereas F-port represents a port number of a HTTP server from which the document is fetched.  The F-query is a categorical feature taking values of either 1 or 0, based on a query component associated with the URL.  Finally, F-depth indicates the depth of the document in a storage hierarchy, i.e., it is simply the number of slashes in a path component of the URL.  Similarly, “size” feature such as, F-htmlSize and F...