Browse Prior Art Database

Method and System for extracting structured information from Wikipedia*

IP.com Disclosure Number: IPCOM000212320D
Publication Date: 2011-Nov-07
Document File: 2 page(s) / 29K

Publishing Venue

The IP.com Prior Art Database

Related People

Surya Ganesh Veeravalli: INVENTOR [+2]

Abstract

Disclosed herein is a method and system for extracting structured information from Wikipedia and making the structured information accessible and usable.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 59% of the total text.

Method and System for extracting structured information from Wikipedia*

 

Abstract

Disclosed herein is a method and system for extracting structured information from Wikipedia and making the structured information accessible and usable. 

Description

A method and system for extracting structured information from Wikipedia and making the structured information accessible and usable is disclosed.

The method and system disclosed herein involves extracting structured data existing in the form of nested lists, tables, info boxes and templates in Wikipedia and making the structured information accessible and usable.  To extract nested lists and tables, domain signals are generated from the Wikipedia article which is in the wiki markup format.  Thereafter, the article is converted to a HTML format and a HTML parse tree is created.  The domain signals are then used to cluster the records and extract records along with the lists. 

To extract the info boxes, the Wikipedia article is parsed to identify the Infobox template.  Once the Infobox template is determined, the Infobox name value is extracted.  If the name value field is associated with the Wikipedia template, then the template is evaluated to determine the value and replace the value field.  Thereafter, the Infobox is represented in an XML format and stored.

Thus, the structured information obtained from articles on Wikipedia may be used by search engines to provide a list of entities being searched by a user.  For ex...