Browse Prior Art Database

System and Method for Purging Data Records in Large Databases

IP.com Disclosure Number: IPCOM000125486D
Original Publication Date: 2005-Jun-03
Included in the Prior Art Database: 2005-Jun-03
Document File: 5 page(s) / 68K

Publishing Venue

IBM

Abstract

Many backend applications keep a database representing the records being processed by the application, such as retail order database for an online store, or an event database for an event management solution. The nature of such applications require the following characteristics of the database: 1. Large capacity 2. High throughput for record creation 3. Search capabilities on the recorded data 4. Life-cycle control for the data, such as archiving and purging obsolete data Aspects 2 and 3 are often affected by aspect number 4. It is difficult to find off-peak hours in applications that server the entire globe, which requires a solution that either archives/purges data as new records arrive or a solution that is fast enough to minimize the maintenance window during off-peak hours. The drawback of archiving or purging data during normal hours is the obvious degradation of insertion performance The drawback of maintenance windows is the unavailability of the system, which may not be acceptable on a regular basis for 24x7 applications. A different solution involves the segmentation of the data in different sets of tables using a time-based criteria. For instance, one can have one set of tables for each day of the week. Whenever a new record arrives, the data is written on the set of tables for that particular day of the week. Archival and purge can be executed on sets of tables that are not actively being used on that day. Some drawbacks: - Queries for data (such as "all orders from customer XYZ" must span across all sets of tables) . - Number of sets of table must be managed. For instance, if the application switches over to different sets of table every day, the system administrator must keep track of purging/archiving older tables, otherwise queries across all sets of table will become progressively slower - If the application chooses to use a 'rolling' mode, where old table sets are reused after a certain period of time (e.g., a set per day of the week) , a misstep in the archiving/purging activities may cause newer data being written on a tableset that has not been archived yet. In such a situation, separation of the data will require manual intervention (e.g. data from Friday 10/8 being written on the same tables as data that was written on Friday 10/1)

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 42% of the total text.

Page 1 of 5

System and Method for Purging Data Records in Large Databases

The idea is two use a data separation model where there is "current" data and "old" data The separation between these sets is defined by a system administrator, not by the application that manages the data.

The application always writes new data to the "current" set of tables but performs queries on both "current" and "old" data.

Archiving and purge can only be executed on "old" data.

The core idea is to separate the data in two conceptual blocks: one for data creation ("current") and another for archival/purge operations ("old")

The advantages are:

Performance for data creation is not affected


1.


2.


3.


4.


5.

System structure

The application has logic that addresses two sets of tables (the word 'set' is used to represent the fact that multiple tables may be required to represent the application records) on a Relational Database Management System (RDBMS), such as DB2 or Oracle. One is called "A", the other is called "B". During runtime, the application can determine which one "active" and which one has been "staged" for archiving and purge.

The architecture can be depicted in Figure 1.

Figure 1 - System structure for the data management

Performance for queries is affected in a predictable manner (there are always only two sets of tables

to be searched)

Performance for archiving is optimal as it is not done concurrently with insertion

Performance for purging is optimal as the set of "old" data can be deleted with RDBMS utilities

instead of SQL DELETE statements.

The administrator is in control of the policy for data archival and purging, not the application logic.

RDBMS

"A" tables

New data

Query

Query

"B" tables

Application runtime

swaps

The "application runtime" is responsible for handling requests for data creation and for data searches.

"A" tables and "B" tables can assume the status of "active" and "staged" during runtime. "Both "A and "B" tables share the same data structure, such as table names, column names, indexes, and others.

"Staged" and "active" are mutually exclusive states, in other words, when "A" is holding active data, "B" us

Staging utility

Extract Delete

Archiving utility

Purging utility

1

[This page contains 6 pictures or other non-text objects]

Page 2 of 5

holding staged data; and vice-versa.

"Active" tables are used to store the new data arriving in the system. The data in the "active" tables is not accessed by archival and purging utilities.

"Staged" tables contain data that can be archived and purged at some point in time. New data is never written to the "staged" tables.

The "archiving" utility uses RDBMS native commands to copy the contents of the set of tables ("A" or "B") currently marked as "staged" to another location (tape for instance)

The "purging" utility uses RDBMS native commands to quickly erase all contents of the "staged" tables. This command typically locks the tables for the brief duration of the operation.

The "staging" utility sw...