Browse Prior Art Database

A Systemic Design for Random Sampling

IP.com Disclosure Number: IPCOM000241381D
Publication Date: 2015-Apr-21
Document File: 5 page(s) / 93K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a query system designed to support random sampling. A new body of techniques allows the implementation of a system that supports sampling as its primary approach to data handling.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 34% of the total text.

Page 01 of 5

A Systemic Design for Random Sampling

This disclosure covers a particular random sampling paradigm, and the employment in a number of application areas. Text analytics solutions provide "needle in the haystack" capabilities, such as search, which return a small list of

documents matching some query, and "broad aggregation" capabilities, such as network visualization, which return

results derived from many documents. In order to respond to queries of the latter type, it may be necessary to consider a huge number of documents. Practitioners often use random sampling to choose a small set of representative documents that may be employed in place of the large collection, with minimal impact on accuracy.

The novel contribution is a system designed to support random sampling.

Sampling applications typically require the creation of aggregate statistical results drawn from many documents in response to a user query. For the framework for such problems is as follows, each operation is based on a subset of the documents, as specified by a powerful query over a very rich vocabulary. Documents either match or do not match the query, although extensions exist in which documents may partially or probabilistically match. Each document in the subset has certain metadata associated with it that must be provided in order to compute the aggregate information (e.g., the presence or absence of mentions of a particular drug on a page near a particular condition.) The aggregate information may be computed from a random sample of the subset with the corresponding metadata. This is simply a statement that the type of question being asked is amenable to a sample based response (i.e. questions such as "are there any documents in the subset that have a certain property" are hard to answer in the negative without looking at all potential documents). The size of the data is such that reasonable query response time prohibits full examination of the entire subset - otherwise random sampling is not required.

This solution covers a system designed to efficiently support such queries, thus allowing the queries to be executed over

web-scale corpora. A new body of techniques allows the implementation of a system that supports sampling as its

primary approach to data handling. These techniques include:

1. When data objects are initially ingested, the system assigns each an identifier in a predefined namespace. The mechanism to assign this identifier is assumed a random hash function.


2. No dependence upon a distributed architecture


3. No concern for data access in the store; this solution only uses the index


4. The indexer maintains all posting lists in ID order


5. The query language allows a large number of selection operators that respect and preserve this random ordering


6. Applications can ingest results in this random order and can terminate the associated requests when sufficient

1


Page 02 of 5

data has been received to develop a statistically significant answe...