Method, Apparatus, and System for Subexpression Cardinality Constraints based Automatic Targeted Query Generation by Pre-processing Correlation Information
Original Publication Date: 2009-Feb-03
Included in the Prior Art Database: 2009-Feb-03
This disclosure introduces a solution (consisting of a method, apparatus, and system) for automatic generating targeted queries. The solution borrows the approximate query processing techniques from data mining arena, and adapts them to fit targeted query generation problem. Essentially, the solution can substitute the parameter values of each local predicates by setting meaningful data values with correlation information, which are aggregated from target test database. However, in order to achieve better performance in very large database scenarios, the required data and correlation information are pre-processed (i.e., sampled, pre-computed, and stored). The relatively large running times and space usage during the pre-processing phase are generally acceptable as a trade-off for performance enhancement during query generation phase. Since at the later phase, we are only able to access a much smaller amount of original data. Additionally, this solution can characterize the overall workload information to favor a relatively high percentage of queries in the workload. In detail, the key points in this disclosure are listed: 1. This disclosure provides an automatic query generation solution that makes the resulting queries satisfy certain subexpression cardinality constraints. 2. This disclosure characterizes the query workload information and extracts the interesting column-groups to support workload scenarios by utilizing frequent itemsets mining principle. 3. This disclosure takes advantage of approximate query processing techniques to pre-process the required data information from the underlying target database. The basic idea is to construct a selected sampling set, pre-compute the data correlation by aggregation, and store the information in materialized views during the pre-processing phase. 4. This disclosure generates queries by looking at the data stored in the materialized views and/or sampling tables, and substitutes the parameters of the input query with real values.
. Background : What is the problem solved by your invention ? Describe known solutions to this problem (
if any). What are the drawbacks of such known solutions , or why is an additional solution required ? Cite any relevant technical documents or references .
1.1. Targeted Query Generation (TQG) Problem
In many test situations, queries for testing new functions of database engines (e.g., optimizer algorithm) and evaluating the performance of database applications (e.g., e-Commerce application) need to be designed very carefully. It is surely necessary to generate desired testing queries in an automated and controlled manner for DBMS-related testing. Give a query Q , a set of target cardinality constraints on intermediate subexpressions, and a test database
, and System for Subexpression Cardinality Constraints based Automatic Targeted Query Generation
and System for Subexpression Cardinality Constraints based Automatic Targeted Query Generation
-processing Correlation Information
processing Correlation Information
, we try to modify Q to generate a new query Qr that satisfies the target cardinality constraints on its intermediate subexpressions. We refer to this problem as the Targeted Query Generation (TQG ) problem . For too long, the task for generating query instances, not randomly, but based on certain constraints, is deemed as very complex and formidable work. This isbecause this problem
is inherently computationally hard. As proven theoretically in the literature , this problem is NP-hard in general. As a result, to date, there are quite few related works in industry, research, and academia.
1.2. Related Work
Microsoft Research firstly investigated the cardinality constrained query generation problem and they formally proved that this problem is generally NP-hard . But their work has an obvious limitation/restriction in that they assume independence between different predicates, which is rarely true in real-world. Our solution can generate targeted queries without any distribution assumption.
QAGen system  considers an inverse problem, to generate a test database given a query. However, if there is already a testing database, QAGen cannot generate the test queries to satisfy certain cardinality constraints.In addition, QAGen attempts to satisfy the test cases exactly, while the performance overheads may be unacceptable when the data volumes are quite large, e.g., gigabytes to terabytes of data. So in that case, the processing ofQAGen is very expensive and intolerably slow.
A novel method, Predicate Parameter Instantiation, is introduced in  for generating testing queries by constructing intermediate result sets iteratively. This method has an advantage: it does not make any independence assumptions between predicates/attribute...