Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Approximate Motif Discovery in Data Streams using Memory-Efficient TeraScale Quotienting Table.

IP.com Disclosure Number: IPCOM000235820D
Publication Date: 2014-Mar-26
Document File: 2 page(s) / 24K

Publishing Venue

The IP.com Prior Art Database

Abstract

The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this work, we propose a new data structure, \emph{Streaming Quotient Filter}, (SQF) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the \emph{signatures} of elements arriving on a data stream, and along with an eviction policy provides \emph{near zero} false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We can present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we can compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss \emph{Dynamic SQF} for evolving streams and the parallel implementation of SQF.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

Approximate Motif Discovery in Data Streams using Memory -Efficient TeraScale Quotienting Table.

The tremendous spurt in the amount of data generated across varied application domains such as information retrieval, tele-communications, on-line transaction records, on-line climate

prediction systems, and virus databases to name a few, has evolved data intensive computing methods into a major area of study both in the industry as well as in the research community. Managing and processing such enormous data constitutes a Herculean task, and is further compounded by the presence of duplicates
or redundant data leading to wastage of precious storage space and compute efficiency. Removing duplicates from such data sources, to help improve performance of the applications, with real-time processing capacity of 1 Gbps or higher provides an interesting problem in this context. This work deals with real-time detection and elimination of duplicates in streaming environments to alleviate the computational prohibitiveness of processing such huge data sources. A record arriving on the data stream is deemed redundant or duplicate if it has already occurred previously on the stream. Formally, this is referred to as the \emph{data de-duplication}, \emph{data
redundancy removal}, (DRR) or the \emph{intelligent compression} problem and the terms have been used interchangeably in this work.

Eliminating duplicates forms an important operation in traditional query processing and Data Stream Management Systems, and several algorithms in this context such as approximate frequency moments, element classification, and correlated aggregate queries have been studied. The real-time nature of the de-duplication problem demands efficient in-memory algorithms, but the inability to store the whole stream (possibly infinite) makes exact duplicate detection infeasible in streaming scenarios. Hence, in most cases a fast approach at the expense of accuracy, but with
a tolerable error rate, is acceptable. In this work, we propose a new and efficient algorithm to tackle the problem of approximate duplicate detection in data streams.

Consider a large tele-communication network generating call data records, (CDR), capturing vital information such as the caller number, callee number, duration of call etc.

Errors in the CDR generation mechanisms might lead to duplicate data being generated. For downstream compute efficiency, before the storage of the CDRs
in the central repository, de-duplication needs to be performed periodically over around 5 billion multi-dimensional recor...