Browse Prior Art Database

An extensible, scalable, optimized multithreaded data loading framework for software applications

IP.com Disclosure Number: IPCOM000240374D
Publication Date: 2015-Jan-28
Document File: 7 page(s) / 240K

Publishing Venue

The IP.com Prior Art Database

Abstract

Applications are often required to load large amounts of data. Often these loading tasks must happen when the application is executing its main business tasks. The loader process is often expected to perform transformation, usually by Extract-Transform-Load (ETL) tools which are not part of the application, have a low amount of reuse, and tend to be dependent on their environment, making them fragile. There is a middle ground where extraction and transformation is not very complex and the loader process is an integral part of an application, triggered by an event (external or internal). In short, lightweight, easy to integrate and with good performance characteristics. Though it is usual practice to spawn parallel threads, there are limitations to approaches such as assigning a dedicated machine for loading. Too many threads may be spawned, leading to contention and ultimately performance degradation. The proposal looks into a method of dynamic allocation of available resources to achieve optimal large-scale data loading in a cloud environment. The application is running on a number of identical Nodes. An Event arrives at a Node initiating a data loading process which reads from a Data Source and commits to the Application Storage. The Loader reads data in paging mode, sending each item to a Router after minimal pre-processing, which in turn decides whether to process internally (via Internal Queue) or append to EMB Queue if the Internal Queue has many items on it. Items from the Internal Queue are read by a Processing Storer (PS), each instance of which is running in its own execution thread, and, after processing, sent to the Application Storage. The EMB Queue is used as a buffer zone holding items to be processed by all nodes. We present an algorithm for calculating the optimal number of Processing Storer threads per application node, thus minimizing the overall load time for a large dataset.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 20% of the total text.

Page 01 of 7

An extensible, scalable, optimized multithreaded data loading framework for software applications

Applications are very often required to load large amounts of data. For example, loading daily data extracts or initializing applications upon starting up. These activities usually take place either in the background, whilst an application continues to serve short transactions (either client or JMS driven), or alternatively during a batch window period. The latter option is becoming increasingly unattractive within a global economy where applications need to remain fully active 24x7. Even with the 'quiet' time available the batch window is usually quite short. For the former case, the loading is usually required to complete within certain period of time, as it is often a part of a complex business case. There is an additional complication in that there could be multiple data sources, not always synchronized, so there could be long period on inactivity, interrupted by bursts of data load.

So the task is to shorten loading period as much as possible.

    Additionally, more often than not, the loader process is expected to perform data transformation. There could be complex cases, with multiple data sources correlated, records to cross check etc. These tasks are usually performed by specialized tools aptly named Extract-Transform-Load (ETL), and are highly tuned to do complex tasks in the shortest time. These tools tend to be quite large and are effectively not an integral part of an application, but an additional component. The typical implementation is a custom solution, highly specialised towards the application model. These solutions have a tendency to be highly dependent on an expected environment and are therefore fragile in implementation. Also, there is a low level of reuse between applications. Use of these tools is not always possible for different reasons (costs, license, size, integration etc).

    There is a middle ground where extraction and transformation is not very complex and the loader process as a whole is seen as an integral part of an application, triggered by an event (external or internal). In short, lightweight, easy to integrate and with good performance characteristics.

    It is usual practice to spawn parallel threads in those scenarios where the sequence of data points loaded is unimportant i.e. there is no cross dependence between data items. However, the applicability of parallel processing might be restricted by availability of threads in the application's environment. In addition, the J2EE specification does not allow for thread creation in some circumstances. Thankfully, J2EE 7 supports JSR236 allowing for a container to provide a pool of managed threads for the application's use. However, this facility may not be available in all circumstances, or the pool could be too small. There is also a natural limit on how many threads can be spawned in parallel, reflecting the number of available processor cores. Even on a...