Browse Prior Art Database

Method and System for Extracting Structured Records from Web Pages

IP.com Disclosure Number: IPCOM000199393D
Publication Date: 2010-Sep-01
Document File: 7 page(s) / 109K

Publishing Venue

The IP.com Prior Art Database

Related People

Sundararajan Sellamanickam: INVENTOR [+5]

Abstract

Disclosed is a method and system for extracting structured records from noisy semi-structured web pages using a Markov Logic Network (MLN).

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 26% of the total text.

Method and System for Extracting Structured Records from Web Pages

Abstract

Disclosed is a method and system for extracting structured records from noisy semi-structured web pages using a Markov Logic Network (MLN). 

Description

A method and system for extracting structured records from noisy semi-structured web pages using a Markov Logic Network (MLN) is disclosed.  The method involves assigning attribute labels to all leaf nodes of a DOM-tree representation of a web page.  An MLN model is initially trained using a set of labeled examples.  To do this, a sample set of web pages is collected and the leaf nodes of each page are manually annotated with the attributes specified in the schema.  Thereafter, the MLN model is specified as a set of pairs (Fi, wi), where Fi is a first-order formula and wi is its corresponding weight.  The weight for a formula is a measure of its importance and the formulas are defined over a set of application specific predicates.  The set of predicates are categorized as evidence (or observed) and query (or hidden) predicates, and the formulas capture the various relationships between the predicates. 

For example, consider a set of web sites W belonging to a specific domain such as Restaurant, Book, etc.  For each of these domains, a well-defined schema that specifies the information to be extracted exists.  For example, attributes like Name, Address, Price and Phone are part of the Restaurant schema.  On the other hand, the Book schema contains attributes like Name, ISBNCODE10, NumberOfPages, etc.  The set of attributes that are to be extracted from the web sites are denoted by A.  In addition to the traditional attributes,  A includes the special attribute Noise that denotes the noisy information contained in web sites.

Here, the query predicates are the attribute labels assigned to page nodes like IsName(n), IsAddress(n), etc., and the evidence predicates are the observed content and structural features like Has5Digits(n), FirstLetterCapital(n), Close(n1,n2) etc.  Then using such predicates, formulas like n Has5digits(n)  IsZipCode(n) and n1; n2 IsName(n1)  IsAddress(n2)  Close(n1; n2) are formed.

Now, for a web site W, if x is the set of evidence predicates that are true for pages in W, then the probability that the set of query predicates q is true is given by:

where Gi is the set of groundings of Fi, g(q ) equals to 1 if the grounded formula g is true for predicate set q   and 0 otherwise, and Z is a normalization constant.  Groundings of a formula are obtained by instantiating variables with web page nodes.

Given an MLN model with (formula, weight) pairs (Fi;wi) and a web site W  W with true evidence predicates , the query predicates q* are computed such that P(q*|) is maximum.  Finding the assignment q* that maximizes the likelihood P(q*|) is equivalent to maximizing the sum of weighted formulas given by:

After specifying the formulas (i.e. the set of Fi), the weights wi for each formul...