System and method to improve the performance of the candidate list generation process of an Entity Analytics system using in-memory, read-only cache
Publication Date: 2011-Nov-04
The IP.com Prior Art Database
Disclosed here is a system and method that improves performance of an entity resolution process by expediting one of its key sub-processes of candidate list generation with the usage of an in-memory read only cache. A method is described here in which a read only cache is being maintained with a set of high priority entitiy information. These entities which are marked to be cached are chosen on the basis of configurable entity priority rules and mechanisms are also provided to keep this cache is kept upto date to keep the entity information in the cache accurate and to account for the changes that have occured in the entity data as a part of the resolution processes.
Page 01 of 19
System and method to improve the performance of the candidate list generation process of an Entity Analytics system using in -memory, read-only cache
When a record is fed into the entity resolution engine, a list of probable matching entities is generated. This process is called as candidate list generation. Its only after that a rigorous
process of scoring/match-making takes place against this identified list of candidates. This is a very important phase of entity resolution process and thus should be executed the with best
possible accuracy and efficiency.
Introduction to entity analytics and related terms:
An Entity is defined as a data structure that uniquely represents a particular person. This entity is associated with a variety of attributes namely name, address, phone number. Entity Analytics comprises of mainly Entity Identification/Building, Entity Resolution and Entity Relationships. More recently there has been an additional facet of associating transactions with Entities too.
It is an operational intelligence process, typically powered by an identity resolution engine or middleware stack, whereby organizations can connect disparate data sources with a view to understanding possible identity matches and non obvious relationships across multiple data silos.
It involves analysis all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non obvious relationships exist between those identities.
It thus helps organizations solve business problems related to recognizing the true identity of someone or something ("who is who ") and determining the potential value or danger of relationships ("who knows who ") among customers, employees, vendors, and other external forces. It also provides immediate and actionable information to help prevent threat, fraud, abuse, and collusion in all industries.
In most popular implementations, these entities are stored in a relational database and this database is called as the entity database. This database also holds information about the obvious and non-obvious relationships that may exist within the various entities in the entity database.
Existing scheme for the generation of candidate lists during entity resolution
For every incoming record, the process of candidate list generation is done to determine those entities which maybe connected to the incoming identity.
Page 02 of 19
The incoming identity data is XML based and a sample of it is shown below illustrating the various data elements which comprise it.
91 552 54 72
121 MAPLE STREET
In the existing candidate list generation mechanism, for each of the data elements of the incoming record, queries are individually done on the entity database in a sequential manner to find the candidates so that...