Application of multiple linear regression to machine resource estimation in flow-based parallel data processing
Original Publication Date: 2005-Nov-11
Included in the Prior Art Database: 2005-Nov-11
Disclosure is a debugging tool for estimating machine resources in flow-based parallel data processing. It is often found difficult to achieve high performance and scalability for large-volume data, especially when complicated processing logic is involved. The aim of this invention is to predict CPU utilization and disk space needed for processing any given amount of data, and provide statistical information that can help users detect bottlenecks in a data flow, identify performance issues, and optimize processing logic. Every flow represents a two-dimensional relationship among its components. Algorithms for defining such a relationship is presented in this article. The mechanism of generating, collecting, storing, and propagating statistical data is described, along with the description on how multiple linear regression is applied to determine statistical correlations.
Application of multiple linear regression to machine resource estimation in flow -based parallel data processing Introduction
Knowing machine resources utilization is important for improving performance in parallel data processing. Having the information available prior to launch a job on production machine is even more important because it can help determine resources allocation and achieve high performance and scalability for processing large-volume data. Users often encounter a situation where a job fails to run to complete due to lack of disk space. It is typical that a job needs to start and complete within a time window, for example, eight hours overnight when the production machine is not in use. If the total CPU utilization can be predicted, the user can then decide how many partitions need to be used to make sure the job finish on time. A job can consist of multiple stages, each of which performs a specific task, such as sorting records, merging multiple records into one record, removing duplicate records, ... etc. Predicting CPU utilization for each stage in the job is also helpful for the user to detect any bottlenecks, identify performance issues, and optimize the job design.
A performance estimation tool is presented in this article. It can be used to estimate CPU utilization and disk space for every stage in a job on each partition, and provide statistical information for users to consider when setting up parallel execution environment.
A job can be represented by a tree graph which consists of stages connected by links. A typical tree graph starts from data sources(input data), ends at data sinks(output data), and may contain intermediate data sinks that release disk space once processing is done. A tree graph also defines a two-dimensional relationship among stages. Each stage is related to any other stage both vertically and horizontally. A vertical relation shows the connection between a stage and its neighbor at both upstream and downstream. A horizontal relation divides stages into groups based on the order they can start to read and write data. Vertical relations allow input data to be propagated from a data source to a data sink. Horizontal relations can be used to determine the maximum intermediate disk space needed for a job.
An example job is illustrated in Fig. 1. In this job, data from two input sources (database table and flat file) is merged, transformed, and joined before written to output. As both the "merge" and the "join" stages require data be pre-sorted, a sort stage is inserted on each input link. Each sort stage needs disk space to save intermediate results if data volume is too large to fit into a pre-allocated memory block.
Fig. 1 Tree Graph Representation of an Example Job
Multiple linear regression is applied to build statistical relations between input data and resource usage for each stage. The generalized form for multiple linear regression can be written as: