Browse Prior Art Database

Accelerated discovery of table relationships

IP.com Disclosure Number: IPCOM000244380D
Publication Date: 2015-Dec-08
Document File: 7 page(s) / 54K

Publishing Venue

The IP.com Prior Art Database

Abstract

The article describes a method for fast detection of table relationship.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 22% of the total text.

Page 01 of 7

Accelerated discovery of table relationships

An important but expensive operation frequently done when profiling data is the discovery of relationships between different tables . This function is typically used to reverse engineer an existing schema for which there is no exact model documentation and no primary key/foreign key (PF/FK) relationships in the source data base. Making the problem even worse, vendors of packaged applications like SAP don 't use referential integrity constraints such as PK / FK in the database at all because they want the application to be database vendor independent . Instead, they try to enforce PK / FK and similar integrity constraints on application level . As a consequence, the logical and physical data models which can be retrieved from the database used by a packaged application do not show any PK / FK information.

Due to this, the user then tries to reverse engineer the data model from the data by using a data profiling tool which will search for pairs of columns that present the characteristics of a PK/FK relationship. Ideally the user points the profiling tool to a group of tables or data sets, invoke the function to discover the relationships and the function returns at the end of the analysis a list of probable PK /FK relationship found among the tables.

That function is expensive because finding if a pair of columns could build a PK /FK relationships involves looking at all values of the 2 columns and check if the following conditions are verified:
one of the 2 column contains mostly unique and non null values so that it could be a primary key candidate
the other column contains only values which are found in the first column (inclusion dependency)

Finding an inclusion dependency between two columns is a computing intensive operation because in the worst case it means comparing all values of one column with all values of the second column, which is about the same as computing a full outer join between 2 columns in a database. There are different scientific papers explaining how to make inclusion dependency more efficient (example: Efficiently Detecting Inclusion Dependencies[1]). But even with the state of the art computing inclusion dependencies is an expensive analysis.

In the prior art, finding all relationships among a group of tables would implies :
Analyze the values of all columns of all tables to collect statistics about the values

1.


2.


3. candidates from other tables for which the statistics collected in 1) indicate that there could be some value overlap with the primary key candidate (data types are compatible, min/max of both columns indicate a possible overlap , etc...)

1

contained in them


Look at the statistics collected in 1) and identify which columns are primary key

candidates


For each primary key candidates identified in 2), identify all potential paired column


Page 02 of 7


4.

For each pair of primary key / compatible paired column, run an inclusion dependency to check if all...