Browse Prior Art Database

Method and System for Validating Information Extracted from Documents based on Selective User Feedback

IP.com Disclosure Number: IPCOM000202426D
Publication Date: 2010-Dec-15
Document File: 2 page(s) / 73K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system for validating information extracted from documents based on a selective user feedback is disclosed. An information extraction system uses annotation models such as, a condition random field (CRF) model, a rules model and a dictionary model, which are validated based on the user feedback. The models are reconfigured periodically based on user feedback for validating information extracted from the documents.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

Method and System for Validating Information Extracted from Documents based on Selective User Feedback

Disclosed is a method and system for validating information extracted from documents based on a selective user feedback.

An information extraction system uses annotation models such as, a condition random field (CRF) model, a rules model and a dictionary model, which are validated based on
a user feedback. The models are reconfigured periodically based on the user feedback. Confidence scores for an annotation is generated using the user feedback. The extracted information is used to reconfigure the annotation models periodically for validating information extracted from documents.

Pre-defined rules have a confidence score associated with them. For example, a ten digit number as a phone number may have a low confidence score. However, if documents are from India, the number may have a high confidence score. Certain information extracted from the documents may be identified as exceptions and sent to the user for feedback. For example, a resume of an applicant is uploaded, and the information extraction module extracts a phone number +11234567890 from the resume. Since +1 is not an Indian phone number format, the information about the phone number is sent for validation to the user and marked as an exception.

In the dictionary model, annotations are generated as names based on the dictionary. One or more names may be assigned a low confidence score if they do not appear in the dictionary. If the user provides a positive feedback for the name annotation, the name is added to the dictionary. Thus, the dictionary is updated to recognize such names in the future.

The figure illustrates an information extraction system, which includes a rule engine, a validation module, and a model reconfiguration module. One or more rules for validating extracted information are used by the rule engine. The one or more rules are entered by an administrator. The one or more rules specify relations between metadata associated with a document and information extracted from the document. The rule engine checks the one or more rules against the metad...