Browse Prior Art Database

A Method for Enabling Web Crawlers to Find Web Pages that are not Pointed to by Existing Hyperlinks but whose Meta Data Resides in a Data Store

IP.com Disclosure Number: IPCOM000131134D
Original Publication Date: 2005-Nov-07
Included in the Prior Art Database: 2005-Nov-07
Document File: 3 page(s) / 99K

Publishing Venue

IBM

Abstract

Web pages can exist in several different file formats like HTML and PDF. A Web page can be accessed via hyperlink from other Web pages. However, oftentimes knowledge of the existence of a Web page only exists within a data store. The actual file does not exist in the data store -- only knowledge of it. This "knowledge" is called "meta data". There needs to be a way of extracting meta data from a data store so that Web crawlers may crawl and index the meta data and link to its content on the Web.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 3

A Method for Enabling Web Crawlers to Find Web Pages that are not Pointed to by Existing Hyperlinks but whose Meta Data Resides in a Data Store

Web pages can exist in several different file formats like HTML and PDF. A Web page can be accessed via hyperlink from other Web pages. However, oftentimes knowledge of the existence of a Web page only exists within a data store. The actual file does not exist in the data store - only knowledge of it. This "knowledge" is called "meta data".

Meta data about a Web page can be as simple as the universal resource locator (URL) of the file, but more often it also includes additional meta data like the title of the page and a brief description of its content. These additional pieces of meta data can enable the user to query the data store, or "meta data store", to locate Web pages of interest.

The result of querying a meta data store can be a list of Web pages or other Web content that satisfies the query criteria. The list of files can be in HTML format so that the user can click on a hyperlink to access any of the files.

Querying meta data in a meta data store may be the only way to access Web pages that are not accessible via hyperlink from other Web pages. This precludes Web crawlers from finding these Web pages because Web crawlers "crawl" the Web by following hyperlinks in Web pages.

A way for crawlers to find a Web page that is not hyperlinked by any other Web page but whose meta data exists in a meta data store is needed. You can programmatically extract the meta data about a Web page from the meta data store and build a small HTML file that contains the meta data, including a hyperlink to the target Web page. This HTML file can be called a "mini-page" since it is a mini representation of its target Web page's content. The mini-page is an HTML page that includes meta data from the meta data store in HTML tags in the section of the file as well as meta data formatted as readable text in the section of the HTML.

This is still insufficient to have a crawler find this mini-page since it is still disjoint, that is, it is not hyperlinked by any other Web page. You can also programmatically create an HTML file that is a list of all of the mini-pages created from the meta data store above. Each list item would be a hyperlink to its corresponding mini-page.

However, this is still insufficient to have a crawler find any of the mini-pages and ultimately their target content pages. Next, you must create a hyperlink in a Web page on a Web site to this list of hyperlinks to the mini-pages. Since the link to this list serves the purpose of enabling crawlers to find Web pages whose meta data is buried in a data store, then you probably don't want to have a visible hyperlink to the page of hyperlinks to the mini-pages. You could create an "invisible" hyperlink in a Web page on a Web site by adding a transparent one-pixel-square graphical image that is contained in a hyperlink to the Web page of hyperlink...