Browse Prior Art Database

Method for Matching Entities by Using Optimal Hashing Schemes

IP.com Disclosure Number: IPCOM000215781D
Publication Date: 2012-Mar-12
Document File: 6 page(s) / 306K

Publishing Venue

The IP.com Prior Art Database

Related People

Vibhor Rastogi: INVENTOR [+4]

Abstract

A method for matching entities by using optimal hashing schemes in order to block attributes of a database individually is disclosed.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 33% of the total text.

Method for Matching Entities by Using Optimal Hashing Schemes

Abstract

A method for matching entities by using optimal hashing schemes in order to block attributes of a database individually is disclosed.

Description

Disclosed is a method for choosing an optimal blocking strategy for an entity matching system by utilizing optimal hashing schemes.  The method utilizes blocking techniques devised for a for a range of matching predicates ranging from simple value-based hashing for exact matches and n-gram based indexes for fuzzy joins to space-partitioning techniques for spatial joins.  The method accordingly attempts to determine an optimal blocking strategy in case of a complex matching function involving predicates.  In general, the determination is performed for an entity matching system having a plurality of rules over several attributes, including fuzzy matches and spatial matches. 

In accordance with the method, determination of the optimal blocking strategy includes defining an abstraction model and blocking strategies.  Thereafter, the method includes utilizing an entity matching function and segmenting the entity matching function into a set of labeled pairs. 

  The entity matcher is defined as a Boolean expression over predicates on attributes.  In a scenario, this is represented as a Disjunctive Normal Form (DNF) formula. 

Typically, an entity matching function is a Boolean predicate on pairs of entities in E, denoted by the function:  assumingE = {e1, e2 …} defines a set of entities.  

Further, a blocking function h denoted by  maps entities to sets of sets of entities.  Each set of entities is a bucket of the blocking function.  Thereafter, a set of blocking functions, referred to as a blocking scheme, is constructed such that each pair of entities that have a match lie in the same bucket of some blocking function.  Assuming,

H = {h1, h2...} is a set of blocking functions, H covers a match function match if:

The total cost of entity matching is defined using a given blocking scheme H as total number of pairs in all the buckets in all blocking functions in H.  In other words, cost of blocking function h is defined as, 

Further, cost of blocking scheme H is defined by:

This abstract formulation enables analyzing blocking schemes for any concrete problem setting.

In order to identify the DNF blocking problem, it is assumed that entities have a set of attributes A = {A1, …, Ak}.  Given an entity e and attribute A, e:A is used to denote the value of the attribute for the entity.  Let match(A) denote a Boolean predicate on pairs of entities that denote a match on the attribute A according to some user-defined function.  For example, a user defined function may be one of, but not limited to, an exact match, a fuzzy match based on string similarity, and a match based on distance.  The entity matching function is given by a predicate  which is a DNF formula over such predicates.

The initial assumption is that matc...