Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Query Processing with Composite Data Sketches

IP.com Disclosure Number: IPCOM000236120D
Publication Date: 2014-Apr-07
Document File: 3 page(s) / 113K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed are a method and system for query optimization on time-series data. For indexing time-series data, the method and system use multiple data representations to provide a trade-off between errors in the query results vs. the response time of the query (accuracy vs. response time).

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

Query Processing with Composite Data Sketches

This disclosure relates to query optimization on time-series data. Specifically, given time series data spanning a large time-window and multiple entities (e.g., Domain Name Service (DNS) queries made by hosts in an enterprise over time in which hosts request addresses of internet domains), assume the user wants to compute histogram/time-series metrics on a dataset. For example, the user wants to determine the entropy over the set of domains that are queried by a given collection of hosts during a time window, and consider the domain-wise histogram of queries for each host to determine the histogram distance (e.g., Euclidean distance) between the queries issued by a set of hosts (S1) during the time window (T1) and a set of hosts (S2) during the time window (T2). Both inquiries are examples of multi-resolution time-series queries.

A method is needed to efficiently answer these queries .

The novel contribution is way of indexing time-series data (called FAQ) that uses multiple data representations to provide a trade-off between errors in the query results vs. the response time of the query (accuracy vs. response time).

The underlying techniques of the novel method and system are based on exploiting the data-sparsity and temporal consistency of large time-series datasets. The method uses a novel combination of dimensionality reduction, temporal compression and aggregation, and quantization to create the multiple representations with different characteristics. Previous research focused on utilizing opportunities across only one of the above dimensions.

FAQ consists of several representations of the base data . Each representation has a different characteristic in terms of the accuracy with which it can answer a query and the time it will take for the query evaluation. The objective of the FAQ is to answer a query in the least time while still satisfying the error bound with a high (pre-determined) confidence.

Different representations of the base data are created using three core underlying schemes: (a) Compression across dimensions (sketches), (b) Aggregation across time (DeltaGraph), (c) Throwing away low valued coefficients (Quantization). Optimized data representations can be achieved through various combinations of the three given representations.

Figure 1: Representations of the base data

1


Page 02 of 3

Data Representations


• Sketches: A sketch is a succinct representation of a data vector such that while it may not reflect the value of the original data item , but it still preserves certain properties. Examples of sketches include Bloom Filters, Count-Min, Min-wise hash.


• Delta-Graph: A Delta-Graph is a hierarchical structure that compactly stores the history of time-evolving objects, optimized for snapshot retrieval. It uses a "delta based" approach of storing differences...