Browse Prior Art Database

Method to Detect Index Web Page

IP.com Disclosure Number: IPCOM000247573D
Publication Date: 2016-Sep-18
Document File: 7 page(s) / 152K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method is proposed to detect index web page by a complex classifier using features based on links and texts partitions, and utilizing the density of link partitions. The method extracts features from link and text partitions including location, number of rows, number of characters, and computes features including distance between two adjacent link partitions, and then computes the average and variance of these distances, and then models these features to classify whether the web page is index or not using any machine learning method.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 59% of the total text.

Page 01 of 7

Method to Detect Index Web Page

Background
Index or homepage similar web page has many links, and sometimes, some summary existing below the link, and does not have one main content or topic. By contrast, content web page should have at least one main content to describe one news or some technology, or some topic, or others.

1



Page 02 of 7

Example Index Web Page

2



Page 03 of 7

Example Content Web Page

Problem
Detect index web page can be used in retrieving real time web pages, extracting text from web page, and is very important in many areas like:

News search

     
Monitoring public opinion
Alert negative news for company or product
Apply detecting index web page in retrieving real time web pages:

Crawl all the web pages of targeted web sites

     
Automatically detect index web pages
Crawl real time web pages by index web pages and scheduling policy Prior art

General rules to filter index web pages, like all links, no or less full stop


Classification with limited features like length of URL(Uniform Resource Locator), depth of URL, number of stops
Method used to extract main content of web page
These methods have problem for special topic like index web pages, which may have a summary similar to content below the link.

3



Page 04 of 7

Solution

Method to detect index web page by a complex classifier using features based on links and texts partitions, including followingsteps:

Link classification: one line occupied by all link or link with time or list number is called link partitio...