Browse Prior Art Database

System and Method for using Very Fast Database Sampling to reduce overhead and time in using production data for performance evaluation purposes

IP.com Disclosure Number: IPCOM000238897D
Publication Date: 2014-Sep-24
Document File: 5 page(s) / 51K

Publishing Venue

The IP.com Prior Art Database

Abstract

With every new release of a product, the implementation of a change request, or a change in the configuration of a deployment, a series of similar tests need to be carried out in order to make sure that the core functionality of the system remains intact [1]. Moreover, 60% of total software development costs are devoted to enhancing existing applications, to add or modify functionality, rather than developing new applications [33]. This means that it is reasonable to expect that in many projects an operational database exists from which the sample data can be extracted (e.g. Web-enabling existing applications). However, generally these databases consist of large amounts of data, which are costly to analyze. As databases increase over time, certain concerns have to be considered, such as scalability, storage space, network and power consumption. Database sampling from the operational data available is a potential solution to overcome these challenges and provide a realistic testing environment. Database sampling has a long history in computer science, proving its usefulness in numerous scenarios where using the entire database is infeasible because of the complexity of handling large amounts of data. In these situations, a compromise has to be achieved in order to analyze the dataset faster and generally a subset of the data is preferred. However, current practices of relational database sampling while preserving the integrity of the data in the sample database are computationally costly. A novel, faster approach for database sampling, that ensures the resulting database respects the same integrity constraints between data as the original database. The proposed method is automated and receives as input the sampling rate from the user. The proposed system does not significantly reduce the performance of the tests while maintaining the same results using a subset of the original database.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 19% of the total text.

Page 01 of 5

System and Method for using Very Fast Database Sampling to reduce overhead and time in using production data for performance evaluation purposes

Disclosed is a system and method for using database sampling to reduce overhead and time when using production datasets for performance evaluation purposes.

Benefits of the proposed system and method

    
As databases increase over time, certain concerns have to be considered, such as scalability, storage space, network and power consumption. Some benefits of sampling large databases are: (i) significantly decrease the storage space for the testing environment, (ii) decrease the administration overhead of managing datasets for the testing environment, and nevertheless (iii) increase the computational efficiency of running the tests using a smaller database. Moreover, sampling from the production environment will determine the sample contains realistic test data, encompassing a variety of scenarios the user created, and serving as an invaluable input for testing the core functionality of the system under development.

    VFDS receives as an input from the user the sampling rate to apply on the original database.

    The system produces the sample database in only one run over the entire database and consists of two phases. In the first phase, the system selects a starting table according to the impact of the table on the sampling process. The system proceeds in randomly sampling tuples from the selected table according to the sampling rate. The second phase of the system consists of recursively sampling the tuples associated (i.e. referencing and referenced) with the already inserted tuples in the sample database.

The system is implemented in the following way:
- The starting table critically impacts the resulting sample database as the method only samples its directly and indirectly associated tuples. By selecting a table with high number of associated tuples, the method increases the probability of meeting the space constraints sent by the user. Moreover, the number of tuples of the starting table contributes to the impact of the starting table. Thus, the system selects as starting table the table with the maximum number of related tuples (i.e. number of tuples of the starting table together with the number of distinct tuples from the associated tables). For this reason, in comparison with previous approaches that employ a top-down approach (e.g. [12,17]), VFDS employs both a top-down and bottom-up approach depending on the starting table.

- The tuples insertion in the sample database is done in two phases: (i) the starting table is filled, (ii) the other tables are filled with associated tuples of the starting table. The latter is performed by inserting tuples in the starting table's children and parents. As the sampling function is a recursive one, its execution with the starting table's children and parents as parameters triggers the insertion in the other tables of the database (i.e. starting tab...