Browse Prior Art Database

Extensible framework for automated extraction of information about identities from unstructured information sources

IP.com Disclosure Number: IPCOM000240102D
Publication Date: 2015-Jan-02
Document File: 7 page(s) / 84K

Publishing Venue

The IP.com Prior Art Database

Abstract

Rule formalization is key The core idea behind this framework approach is that the rules that are used to extract identities from unstructured information (free format text) are formalized in a way that supports the following actions: A) A computer program called an analysis engine can use existing text analysis technology to extract information about specific identities such as persons or companies. B) Information can be derived from one sentence or across multiple paragraphs or documents and be linked together. C) Indirect conclusions about relationships between extracted identities can be drawn. D) A user can update or enhance the formalized rule definitions on the fly. To enable this flexibility, the rule definitions are stored in a human and computer readable and editable document. The formalized rule definitions use standardized regular expressions to search content for identities and to detect relationships between entities. In other words, these rules define which information to search for and how this information should be stored. The analysis engine interprets the rules and applies them to analyze human readable sentences that are downloaded from digital sources. These rules can be updated dynamically by a user. The extracted information is added to a structured data source called the knowledge base. A simple report generator can then use this structured information to generate a report about all detected identities, including indirect information described via relationships in the knowledge base. For example, Name: John Doe, Employee: IBM, Company Main HQ: Armonk, and so on.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 23% of the total text.

Page 01 of 7

Extensible framework for automated extraction of information about identities from unstructured information sources

The internet contains an enormous amount of unstructured information which is constantly increasing. Collecting information about identities such as companies, persons, locations, and so on, today typically involves using either general purpose search engines like Google®, or specialized search engines like the people search engine Yasni®. However, these search approaches are restrictive. The problem about using general purpose search engines is that the effort is manual and very time consuming and, in the case of specialized search engines, the returned information has a fixed scope. Also these search engines are usually static in what they can search, and usually cannot search private information kept in local content management archives, in databases, or on intranet web sites.

The approach described in this paper goes different ways and follows the idea of an extensible framework for structuring information that is extracted from unstructured digital data sources in a formalized, open, and extendable way, enabling even a subject matter expert without IT development skills to enhance or modify the framework.

In addition to recognizing obvious relationships like "Person A works for Company B", the framework also supports the detection of indirect relationships spanned across multiple paragraphs like "Person A started his career in a big IT company. This Company was called B.". Furthermore, the framework enables detecting information and relationships across documents like "Document 1: Person A worked for Company B; Document 2: Company B is located in City C -> Person A works for Company B,

which is located in City C".

Rule formalization is key


The core idea behind this framework approach is that the rules that are used to extract identities from unstructured information (free format text) are formalized in a way that supports the following actions:

A) A computer program called an analysis engine can use existing text analysis technology to extract information about specific identities such as persons or companies.

B) Information can be derived from one sentence or across multiple paragraphs or documents and be linked together.

C) Indirect conclusions about relationships between extracted identities can be drawn.


D) A user can update or enhance the formalized rule definitions on the fly.

To enable this flexibility, the rule definitions are stored in a human and computer readable and editable document. The formalized rule definitions use standardized regular expressions to search content for identities and to detect relationships between entities. In other words, these rules define which information to search for and how this information should be stored. The analysis engine interprets the rules and applies them to analyze human readable sentences that are downloaded from digital sources. These

1


Page 02 of 7

rules...