Browse Prior Art Database

Description of a web site map and other hypertext networks using an XML grammar

IP.com Disclosure Number: IPCOM000125915D
Original Publication Date: 2005-Jun-22
Included in the Prior Art Database: 2005-Jun-22
Document File: 5 page(s) / 44K

Publishing Venue

IBM

Abstract

Disclosed is a method to describe the structure of an hypertext network using grammars, in particular XML-related grammars such as the DTD format.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 22% of the total text.

Page 1 of 5

THIS COPY WAS MADE FROM AN INTERNAL IBM DOCUMENT AND NOT FROM THE PUBLISHED BOOK

ARC820040218 Marc D McSwain/Almaden/IBM

Johnny Accot

Description of a web site map and other hypertext networks using an XML grammar

Main Idea

Introduction

Most web sites provide a "site map", that is a summary of all the pages that are offered to the client. Site maps are very useful when one wants to find a particular page in a complex site, and when a search is not desired or appropriate. Site maps are also very useful for web crawlers, whose aim is precisely to find all the pages on one site and index them.

Today, there is no specific standard for describing a site map. First, in terms of formatting, It is usually implemented as an HTML file, which is linked from the site home page. Second, in terms of location, there is no standard place where a site-map file is to be located and queried. In other words, site maps are completely site specific and follow no precise rules.

The lack of standard format and location for site maps have negative consequences for the different web components. A first consequence is that a client program cannot automatically find the site-map file nor parse it. The only real way to use current site maps is then by a human user, who has the cognitive capability to find and understand the map. It is in particular almost impractical to try to design an algorithm that would be able to get and analyze the map describing a given site. The second consequence is that, if an algorithm needs to have a representation of a site map, since it is not possible to download it from a single file, it will have to actively query all the pages on the web site, in a recursive fashion, and build its own internal representation of the site map. Furthermore, if it needs to save the resulting site map, it will have to use a custom format given the lack of standard, which limits portability and data sharing. Finally, the recursive retrieval of an entire web site in order to build a site map is a very costly operation, both for the client and the web server, both in terms of network bandwidth and CPU used. The client indeed has to query all pages, with all their content, just to extract the link topologies. On the other side, the server has to send all the data that is available each time a new client needs a complete site map.

Given the limitations described above, it seems necessary to provide a method for querying site maps in an efficient manner. This includes that it should be possible to describe an entire site, with all the internals links and possibly external links, in a single file, located in a standard location, encoded in a standard format. Each client that needs the site map can just retrieve this file, giving it in a few kilobytes of data the information it would have taken minutes or hours to download if it had to crawl the entire web site.

The method described below aims at describing the site map - that is the network of hypertext...