Framework for Stream De-duplication using Biased Reservoir Sampling
Publication Date: 2012-Mar-31
The IP.com Prior Art Database
This work demonstrates a novel Reservoir Sampling based Bloom Filter,(RSBF) data structure, based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in evolving data streams. It shows that RSBF offers the currently lowest False Negative Rate (FNR) and convergence rates, and are better than those of Stable Bloom Filter (SBF) while using the same memory. Using empirical analysis on varied datasets, it exhibits upto 2x improvement in FNR with better convergence rates as compared to SBF, while exhibiting comparable False Positive Rate (FPR).
Page 01 of 6
Framework for Stream De -
Data intensive computing has evolved into a centraltheme in the research community and the industry. There has been a tremendous spurt in the amount of data being generated across diverse application domains such as IR, telecommunication (call data records), online transaction records, web pages, medical records, virus databases and climate warning systems to name a few. Processing such enormous data is computationally prohibitive, and is further compounded by the presence of duplicates and redundant data, wasting precious compute time. Removing redundancy in the data helps in improving resource utilization and compute efficiency especially in the context of stream data, which generally requires real-time processing at 1 GB/s or higher. This work considers the problem of real-time elimination of redundant records present in large streaming datasets. A record may be considered redundant, if it had arrived previously in the stream. Formally, this is referred to as the data deduplication or intelligent compression problem. Data redundancy removal (DRR) and deduplication are used interchangeably in this report.
Consider, for example, a large nation wide telecommunication network, where each call generates call data records (CDRs). Each CDR contains details about a particular call such as the calling number, the called number and so forth. Due to errors in CDR generation, multiple copies of a CDR may get generated. Before storing these CDRs in a central data center, one needs to perform deduplication over around 5 billion CDRs with real-time performance. Solutions involving database accesses as in traditional systems are prohibitively slow. Since algorithms involving typical Bloom filters such as  are extremely resource intensive with huge memory requirements (20GB or higher for 6B CDRs at FPR = 1e-5), applications have to resort to disk based Bloom filer data structures at the expense of reduced performance. Hence, there is a strong need for deduplication algorithms that work in-memory or with reason-able memory, have real-time performance and also have low FPR, FNR and better convergence rates. This poses a very challenging problem.
Search engines regularly crawl the Web to update their corpus of webpages. Given the list of URLs extracted from the content of a crawled page, a search engine must probe its archive to determine if the URL is already present in its collection, and if re-crawling of the URL can be avoided . This involves duplicate detection, which in practice may be imprecise but is indispensable given the number of webpages present in the Internet. The consequence of an imprecise duplicate detection is that some already-crawled pages will be crawled again (caused by FNR), or some new URLs which should be crawled are missed (caused by FPR). Here, a high FNR might lead to severe performance degradation, while a relatively high FPR results in new pages being ignored leading...