Browse Prior Art Database

Method and System using High-Level Language for Web Knowledge Extraction

IP.com Disclosure Number: IPCOM000214913D
Publication Date: 2012-Feb-13
Document File: 6 page(s) / 80K

Publishing Venue

The IP.com Prior Art Database

Related People

Aravindan Raghuveer: INVENTOR [+13]

Abstract

A method and system using High-Level Language for Web Knowledge Extraction.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 37% of the total text.

Method and System using High-Level Language for Web Knowledge Extraction

Abstract

A method and system using High-Level Language for Web Knowledge Extraction. 

Description

Disclosed is a method and system for using High-Level Language (HLL) for Web Knowledge Extraction. 

Currently in order to extract structured information for example, review ratings, store hours of operation, store phone number, hotel photos etc. from web, extractors need to run their own custom workflows expressed in custom language or no particular language.  A knowledge extraction workflow developer who has to extract information from all mortgage provider sites for mortgage loans hub page or extract information for all sports personalities from sports sites, need to build the solution from scratch.  Thus, the developers need to concentrate on how to orchestrate the whole workflow and how to describe and communicate the workflow to others.  Typically, the actual workflow itself is captured only in the minds of developers/documents, so it necessarily requires a custom implementation.  The disclosed method utilizes a HLL for web knowledge extraction that can be run on an application server as shown in figure.  The HLL provides a standard workflow application model for these applications.  The workflow application model may be associated with an uber workflow language which covers both standalone i.e., off-grid and distributed i.e., on-grid – hadoop style parts of the workflow. 

Figure

In the example shown in figure, the workflow defined through HLL includes three components such as, a Rule Copier Action, a HDFS Surfacer and a XSLT Extractor.  The Rule Copier Action is a workflow specific custom component and the HDFS Surfacer and the XSLT Extractor are application server provided library components.  HLL supports both these kinds of components.  The HLL allows every workflow node to describe itself through a metadata specification.  In the metadata specification, the component describes its inputs, outputs, transitions and configuration parameters.  As the names imply inputs and outputs specify the inputs, and outputs the component consumes and produces respectively.  The transitions specify different state transition signals that the component can signal.  Further, configuration parameters are for fine tuning the component configuration.

Further, the specification of a high-level language used may be as follows:

      <?xml version="1.0" encoding="UTF-8" ?>

- <process name="SDEExtraction" xmlns:kafe="http://www.yahoo.com/kafe">

- <start>

  <transition to="RuleCopierAction" />

  </start>

- <custom class="com.yahoo.sde.rulecopier.RuleCopierAction" name="RuleCopierAction">

- <!--

 Input

  -->

- <property name="ruleInfoList">

  <string value="evar://kafe.flow.ruleInfoList" />

  </property>

- <property name="outputDir">

  <string value="hdfs://$kafe.grid.basedir$/$kafe.solution$/RuleCopierAction.output/$kafe.execution.timestamp$/$kafe.execution.id$" /...