Browse Prior Art Database

Method and System for Extracting Content from Web-pages for a Search Index

IP.com Disclosure Number: IPCOM000197956D
Publication Date: 2010-Jul-23
Document File: 2 page(s) / 28K

Publishing Venue

The IP.com Prior Art Database

Related People

Jayant Shekhar: INVENTOR

Abstract

Disclosed is a method and system for indexing content from web-pages for a search index.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 54% of the total text.

Method and System for Extracting Content from Web-pages for a Search Index

Abstract

Disclosed is a method and system for indexing content from web-pages for a search index.

Description

A method and system for indexing content from web-pages for a search index is disclosed. 

A web crawler is generally used for traversing through web-pages and for extracting content from the web-pages.  The extracted content is thereafter used for a search index.  The extraction of the content is governed by a logic used by the web-crawler.  In such a scenario, content publishers are unable to define a custom logic for extraction of content from web-pages created by the content publishers.

The method and system disclosed herein enables content publishers to submit a configuration file.  Further, a list of URLs is also provided.  Based on the configuration file the method and system extracts relevant content from web-pages corresponding to the list of URLs.  The content may be used for a search index, thus enabling content publishers to define a custom logic for extraction of content by submitting the configuration file.  In an instance of the method and system, the configuration file may be a script.  The list of URLs may be provided as a set of regular expressions.

For example, a web-page on a shopping portal may have structured content as illustrated below:

- Title of a Product

- Rating

- Price

- Product Details

- Product Description

Most of the web-pages in the shopping portal will...