Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

A Method to Optimize Join Performance Between Relational Database Tables and External Unstructured Data

IP.com Disclosure Number: IPCOM000236429D
Publication Date: 2014-Apr-25
Document File: 6 page(s) / 87K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to optimize join performance via the use of a database server that supports the relational model and an associated materialized query table (MQT).

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 40% of the total text.

Page 01 of 6

A Method to Optimize Join Performance Between Relational Database Tables and External Unstructured Data

When analytics must be performed on data externally stored from the central database engine , the external data, stored in flat files or in other formats, is typically accessed by a database engine via a User Defined Function (UDF - a common relational database engine mechanism for providing extensibility via functions that can be evaluated in SQL statements ). The user defined function interfaces with the externally stored unstructured data and is responsible for returning a relational table view of the data .

In a computer cluster, each computer is responsible for a hash-portioned subset of the data. The relational tables thus are hash partitioned on a column or set of columns. Optimal join strategies in distributed relational databases tend to exhibit very little data movement. Joins are executed locally for example. However, due to the purely logical nature of the relational table constructed over the external unstructured data, many common Structured Query Language (SQL) optimization techniques are not possible.

A method is needed to utilize materialization to improve join performance.

The solution is to optimize join performance via the use of a database server that supports the relational model and an associated materialized query table (MQT). The materialized query table, if constructed having the same partitioning key as the key that will be used to join the unstructured and structured data, can be used in place of the UDF. The advantage is the plan then exhibits locality. Of course, there is the duplication of data, and the time required to construct the MQT. However, the MQT is available and re-available, and the join performance will be much better.

An example illustrates the general idea. Assume that the user wants to join external unstructured data projected by the UDF CSVREADER with the table table2.

The basic method is to simply join the two tables. The following statement shows the approach NOT using the MQT, joining a relational table (table2) to a logical view of an external unstructured data set. In this case, the join key is the partitioning key for table2.

Statement:

select tx.col1, tx.col2, tx.col3

from table(CSVREADER('{"filename" : "/home/hotellnx98/stevera/zz.csv"}')) as TX(col1 varchar(20), col2 varchar(20), col3 int) , table2

where tx.col1 =table2.col1

1



Page 02 of 6

Section Code Page = 1208

Estimated Cost = 78.575272

Estimated Cardinality = 460.000000

Coordinator Subsection - Main Processing:

Distribute Subsection #1

| Broadcast to Node List

| | Nodes = 0, 1, 2, 3

Distribute Subsection #2

| Broadcast to Node List

| | Nodes = 0, 1, 2, 3

Access Table Queue ID () = q1 #Columns = 3

Residual Predicate(s)

| Return Data to Application

| | #Columns = 3

Return Data Completion

Subsection #1:

Access Table Queue ID () = q2 #Columns = 1

Residual Predicate(s)

| Process Build Table for Hash Join

Hash Join

| Estima...