Web page tracking system
Original Publication Date: 2005-Dec-05
Included in the Prior Art Database: 2005-Dec-05
A system to uniquely identify a document
Web page tracking system
In the Internet environment, Web pages change regularly causing confusion and frustration to users.
In one scenario, a user visits a Web page (e.g. http://www.corp.com) and views information of interest. The user then decides to re-visit the Web page at a later stage. When the user returns, they can find that the information is not present, i.e. the Web page has been updated. Typically, a user is not informed of these updates because, for example, there is no publish-subscribe mechanism available. In another example, the Internet does not support any form of standardised version control that would allow a user to track back through the updates made to the Web page and select the appropriate version of the Web page.
A solution to this problem is an archive service, which can take snapshots of a Web page. However, in order to retrieve the information, the user is reliant on a snapshot to be taken before the Web page was modified. Furthermore, it should also be understood, that although an archive service can capture information, the archived Web page may not look the same as the Web page previously viewed by the user.
Thus, a user is faced with a number of problems when a Web page is updated. For example, if a Web page (and/or information within a Web page) is updated (e.g. a Web page is moved) without notification to a user, the user may not be able to access the original Web page (e.g. since they are not aware of the new location of the Web page).
In order for a user to view a Web page that is exactly the same as the Web page viewed on a previous visit, an additional comparison step is required.
The article details a system and method for uniquely identifying documents (e.g. Web pages) that are not maintained within a single document management system.
In order to keep track of individual versions of Web pages, extra data is associated with a link to that page e.g. a hash value of, for instance, the content of the Web page. Thus, when a Web page is updated, the hash value associated with the original Web page is compared with a newly computed hash value associated with the updated Web page. If the hash values match (see below for a brief discussion on false positives) then it is reasonable to assume that the Web page has not changed. If the hash values do not match, this indicates that the Web page has changed in some way. It should be understood that this article does not attempt to describe a system that determines how a Web page has changed, but provides a system and method that determines that a change has been made.
Below is a description of how a Web page that has been moved can be located (in other words, how a user can find the original Web page).
1. When a link to a Web page is created, the link comprises extra data (e.g. a message digest value) in the text of the link, wherein the actual HTTP link will remain unchan...