ESA DATA COMPRESSION
Original Publication Date: 2001-May-01
Included in the Prior Art Database: 2003-Jun-18
Previous proposals (relating to DB2's support of compression) built a dictionary after analyzing every row in a table (to be compressed). The problem is that dictionary building cost is prohibitive. The solution is to use a representative sample of rows to build the dictionary. Random sampling is a known method to pick representative samples. The problem is to find a control point when to stop random sampling. This is the point at which we have received most of the compression possible, without sampling all records in the file. Our solution is to stop when the dictionary becomes full. The problem is that to get random samples we need random I/O that is expensive or impossible when the data (to be loaded) is on a medium that does not support random I/O (e.g. tape). The solution is to pick out samples during a linear scan of the data. First samples are picked with fine granularity, and the granularity of picking samples is gradually made coarser. We refer to this approach as incremental sampling. The problem is that no method to control the granularity of sampling during a linear scan was known for building the dictionary. We solve this problem by a method based on 'control points' determined by the number of rows needed to fll the dictionary. Known sampling methods store the samples in the reservoir, and discard extra samples if more samples than are needed are picked. The problem is that reservoir support is needed, a feature not supported by the current DB2 Product. The problem is solved by our incremental sampling method by using each sample to build the dictionary as it is picked. A sample is never stored. The problem is that this introduces a bias towards the beginning of the data, where sampling granularity is fine. Another related problem is that if more samples are picked than are needed to fill the dictionary a method is needed to prune the dictionary. We solve both problems by using a method called FLEA, which deletes the first 'leaf node' encountered with a circular chain of leaf nodes, in the dictionary which is stored as a tree of nodes. The FLEA algorithm introduces a bias towards the end, providing a degree of compensation for the bias towards the beginning due to the sampling. Finally, in a previous DB2 design there was a problem for the sampling algorithm to be aware of the 'control point' (where the dictionary became full for the first time) as an artifact of the dictionary builder and tree building methods residing in different address spaces and there being no interface between the two to pass the needed messages. We solve this problem by giving a method that estimates the 'control point' from the expected compression.