Browse Prior Art Database

Extraction and Transformation of Dates from Web Content to Improve Relevancy and Date Sorting for Search Results

IP.com Disclosure Number: IPCOM000201020D
Publication Date: 2010-Nov-04
Document File: 3 page(s) / 25K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to improve search results for users employing web-based search engines. The invention focuses on having content authors provide publish/update dates as a meta tag, which is invisible to users on web pages. Search engine crawlers identify the meta tags, and use the information to present the end user with the most recently updated web content ranked higher in the results list.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 37% of the total text.

Page 01 of 3

Extraction and Transformation of Dates from Web Content to Improve Relevancy and Date Sorting for Search Results

The goal of any web-based search tool is to return the best possible results based upon user-entered criteria such as keywords. Search engines return results with the items most relevant to the search terms high in the results list. This is a vital attribute, especially if the results are not relevant

The ways in which search engines determine relevance are varied and complex. One often-overlooked variable is how up-to-date or recent the material is. How much this contributes to relevance can vary based upon the type of query. For some types of queries, the latest pages rank higher in the results list. Even date trend analyses drive relevance determination. Furthermore, most search engines provide the user with the ability to sort or filter by date. This is to meet the need when a user knows he or she wants the most recent material on a topic.

Given that the date of materials is important to search engines both explicitly (controls to users), and implicitly (a contributor to determining relevance), one would expect that these engines would aim to be extremely accurate in how the publish dates of the pages is recorded. In reality, they are lacking in this regard.

Briefly consider how search engines work. As a simple description: search engines find pages, and note the date that they find pages. They also re-crawl pages already found, and note whether or not they have changed. What is reported to the user, and employed behind the scenes to help determine relevance, is the date the search engine found the page, or found that a change has been made. Thus the "crawl date" (the date the search engine found it) from the search engine's perspective is used as the "published on" date and the "updated" date.

There are a number of ways the crawl date can be misleading, especially from a user's perspective:
1. Crawlers don't always immediately find pages. An author of a web page updates the page today. If the search engine crawler doesn't hit the page until next month, it will report to users that my page was updated a month from now. The user concludes that the information on the page is much more out-of-date than it actually is.

2. Users and Web crawlers may not have the same definition of an update. A page author could, for example, update a couple of meta tags (say, added a few items to the keywords meta tag). Suppose the crawler found the page today. The crawler would set the crawl date to today, having detected the meta tag change. However, meta tags are invisible to users. So, a user familiar with the page would be confused as to why the search engine thinks it was recently updated.

3. Deployment changes can lead to a new crawl date. When a site gets moved to a new server, for example, the crawler will see all the pages as changed. This

1


Page 02 of 3

means that very old content can be given a very recent crawl date, and fro...