Browse Prior Art Database

Method for matching entities from multiple tables by training a classification model

IP.com Disclosure Number: IPCOM000236884D
Publication Date: 2014-May-21
Document File: 2 page(s) / 38K

Publishing Venue

The IP.com Prior Art Database

Abstract

Different profiles of an entity are captured and digitalized into different tables, and joining multiple database/tables for big data analyzing is ubiquitous requirement, while it is nontrivial in case foreign key is missing or noisy. Existing solutions try to match the identifier-missing records from multiple tables by their attributes in a hard-matching manner. While in fact many attributes might not be necessarily strictly the same for a pair of identical entities from two different tables. We propose a training based approach to decide which pair of two entities shall be matched or non-matched. The training process allows us to capture the strict matching and relax matching behavior from the labeled training data. As such, the advantage is we can automatically obtain the domain specific matching rule from the data instead of explicitly write it out.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 63% of the total text.

Page 01 of 2

Method for matching entities from multiple tables by training a classification model

Different profiles of an entity are captured and digitalized into different tables, and joining multiple database/tables for big data analyzing is ubiquitous requirement, while it is nontrivial in case foreign key is missing or noisy. Existing solutions try to match the identifier-missing records from multiple tables by their attributes in a hard-matching manner. While in fact many attributes might not be necessarily strictly the same for a pair of identical entities from two different tables.

Our disclosure is based on the following two key observations: For different database and domain environment, the attributes are noisy

1) imprecise recording of attributes are very usual in practical applications

2) different attributes have different sensitivity for matching metrics, for example, company names have rough matching accuracy, but for gender, the terms - male/female are very distinctive

We propose a training based approach to decide which pair of two entities shall be matched or non-matched. The training process allows us to capture the strict matching and relax matching behavior from the labeled training data. As such, the advantage is we can automatically obtain the domain specific matching rule from the data instead of explicitly write it out.

the step by step are as follows:

1) an expert is asked to label n positive training samples which equal to the records that can be...