Browse Prior Art Database

SCORING TEXT SEGMENTS IN A DOCUMENT BY INTENT

IP.com Disclosure Number: IPCOM000234673D
Publication Date: 2014-Jan-28
Document File: 5 page(s) / 154K

Publishing Venue

The IP.com Prior Art Database

Abstract

Different sections of a particular web page has been composed by its author with various level of intent. While an article may contain various different segments of texts and pictures, located at different places on the article, each one of them has been created by the author keeping in mind a particular extent of intent. The main intent of the author is to write the main content of the web page (most likely it is the body or the main article) and other segments of the same page have lesser levels of intent from an author's perspective. We propose a technique to determine this intent level for all the segments in a web page, and eventually assign a score to quantify the author's intent.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 30% of the total text.

Page 01 of 5

SCORING TEXT SEGMENTS IN A DOCUMENT BY INTENT

The online web is made of billions of web pages. Each one of them has its own template design. For example, the main page from the TimesofIndia (http://www.timesofindia.com) may look as follows

while the main page of another news site viz. NewYorkTimes (www.nytimes.com) may be as shown below:

Likewise, each web page is designed in its own manner with an intent to publish different contents of the page at different places in the html template. At the same time, the author of the page intents to publish the different content items with varying interest ranges. For example, the main intent of the author in the TimesofIndia page would be to publish the news headlines with their snippets as shown in the above screen shot. In addition to this, he intends to provide space for comments for these articles, if applicable, below the page and some space for various kinds of advertisements, icons, images and other breaking news headlines, and so on. Thus, when a html page is published with different content items inside its template, each one of those items is associated itself with a different measure of important for that particular page. Any information consuming entity may not be interested in treating all these various pieces of content items with the same amount of importance.

Any system that is about to process information published in the html templates would be interested in knowing the main content of the page for which the template was intended to be designed, and then in addition other pieces of content. This is because the system would like to process the main content first before processing any other content item in the page.

We are proposing a technique that aims at being able to identify different sections inside the html pages, regardless of the source of the page, automatically and then assign scores according to the importance of the particular section to the page. The proposed technique would be capable of taking in a html page, along with any markup and styling information, and then leveraging natural language processing capabilities to identify text segments and score them according to how important a segment is perceived to be in that particular html template.

1



Page 02 of 5

Currently, there are various software tools that are designed which are capable of breaking up a html page into text segments. They are capable of understanding the html markup language and then segmenting the template into various different sections. These sections are identified based on the structural cues available inside the markup and styling information of the html page. the segments identified are then presented to the end-user. Few of the existing tools are as follows:


1. Boilerpipe: http://code.google.com/p/boilerpipe/

Description: Boilerpipe is an advanced algorithm library that is available and can be used to remove the clutter (such as headers, footers and other text) around the main article...