Browse Prior Art Database

A Generic Method for Crawling Dynamic Web Pages

IP.com Disclosure Number: IPCOM000206840D
Publication Date: 2011-May-10
Document File: 3 page(s) / 86K

Publishing Venue

The IP.com Prior Art Database

Abstract

Crawling is a fundamental task in Search Technologies. In this paper, a new crawler is proposed, which uses two steps to analyze a page. First it will find out all click points and sub menus by hovering the mouse pointer. Then it will click all click points and find out all possible links.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 3

A Generic Method for Crawling Dynamic Web Pages

   Crawling is a fundamental task in Search Technologies. By extracting links in a web page, a crawler can quickly access large amount of web pages. But this task is becoming more and more difficult with the wide usage of JavaScript and Flash .

   In JavaScript web pages, some links are dynamically generated by JavaScript based on the mouse or keyboard input from user. It's almost impossible for a crawler to acquire links by analyzing the html content.

Adobe Flash makes crawling even more difficult

                                 . It's impossible for a crawler to understand the internal logical of a flash file. Web pages based on flash can only be regarded as a black box .

   This invention proposes a general method that can extract links from any web page, including pages which are composed by JavaScript, flash and those pages which only support a specific browser.

   For any web pages which can't be processed by other crawlers, our crawler will automatically load it in a web browser. The web browser has been configured to use a HTTP proxy which is controlled by our crawler.

Anything sent out or received by the web

(This page contains 00 pictures or other non-text object)

Our crawler uses two steps to analyze a page. First it will find out all click points and sub menus by hovering the mouse pointer. Then it will click all click points and find out all possible links.
3.1 Analyze page by hovering


Our crawler will do the following step to analyze the links in a web page,

browser is visible to our crawler (except secured websites).

   Our crawler will take a screen shot of the web page and analyze the content of the page. For any area on the web page that looks like a link. Our crawler will click it. If the backend HTTP proxy received a HTTP request after the clicking, i...