Browse Prior Art Database

Stream Merge/Multi File Join

IP.com Disclosure Number: IPCOM000198657D
Publication Date: 2010-Aug-11
Document File: 3 page(s) / 31K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a stream merge algorithm that uses tuple generators to feed it. These tuple generators are responsible for bounding threads used, generating events on changes, and buffering tuples. This allows producers to the stream merge to only be responsible for feeding tuples and managing resources based on the events received. The implementation is robust, highly scalable, and fail-safe.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 3

Stream Merge/Multi File Join

When a product has a generic mediation platform that can accept input with temporal information, the input can be in any format and transferred by several means. This means data -- potentially thousands of files containing millions of records in each -- can be split across multiple files and be in any order. Allowing producers for each individual files to run concurrently does not scale; high thread switching occurs and too many files concurrently open. As a result, systems waste resources and are subject to the limits of the operating systems. An improved method must be efficient and fail-safe.

This presents a challenge to handle the various ways that this data can be combined to transform into a normalized temporally-ordered data stream.

Current known solutions do not exhibit:

• Efficient use of file/external resources

• Efficient use of memory

• Low thread switching overhead

• Efficient data pump with very low probability of starvation

• Fail-safety

In the current process, the first stage is gathering the data and grouping it together in blocks of periods. The second stage is parsing, sorting, and merging the data. The final stage is outputting the normalized data.

The second stage is where the bottleneck for performance occurs. It's easy to let resources (memory, file handles, network throughput) thrash, or worse, be exhausted with a naïve solution. The implementation described here eases the burden of stage two by grouping data together by time periods that are set by the user for granularity and latency.

Even with the first stage, however, a naïve approach can still require too many resources and deadlock the system. With this invention, a series of buffers loads data and sends events when the buffers are exhausted or full. This allows efficient management of the resources. Threads for each input are bound to a maximum number allowed. This keeps many data inputs from thrashing.

The solution breaks down each problem area into its own set of objects and configures them so they can communicate and effectively cooperate. A buffering method improves...