Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Clustering Tree Algorithm for Web Page Classification

IP.com Disclosure Number: IPCOM000035209D
Original Publication Date: 2005-Jan-20
Included in the Prior Art Database: 2005-Jan-20
Document File: 1 page(s) / 36K

Publishing Venue

IBM

Abstract

The algorithm presented enables fast and efficient (~O(log N)) insertions, updates and searches to a tree whose nodes all have the property of being able to specify whether a specific phrase or phrases may appear in the leaves below them.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 80% of the total text.

Page 1 of 1

Clustering Tree Algorithm for Web Page Classification

The problem is clustering of web-pages for fast searching based on the phrases present in the document. Techniques exist for searching, often based on word-frequency tables, for documents that contain certain words, but the problem of finding highly similar documents - in fact searching a large document space for documents similar to one of the documents - requires a different approach.

    The tree is used for searching for pages with similar "content-phrase fingerprints", as described in other disclosures. Given the nature of the problem, searches should be extremely fast even when operating over a document spaces of the order of 8 billion documents (the Internet), distributable over multiple machines, and - since spiderings are iterative rather than batch operations in the ideal case - able to be built and maintained iteratively rather than needing to be built as a batch job.

    The algorithm presented enables fast (~log(N) time) insertions, updates and searches to a tree whose nodes all have the property of being able to specify whether a specific phrase or phrases may appear in the leaves beneath them.

    An iterative algorithm is presented suited to the task of clustering web-pages based on the overlap of their content phrases. The nodes of the tree use "fuzzy hash sets" which store only the hashed values of their contents and not the contents themselves, and accept a certain degree of collision (within a given pr...