Browse Prior Art Database

Method and System for Rank-Driven Exploration of Large Structured Datasets

IP.com Disclosure Number: IPCOM000195340D
Publication Date: 2010-Apr-30
Document File: 5 page(s) / 182K

Publishing Venue

The IP.com Prior Art Database

Related People

Sihem Amer-Yahia: INVENTOR [+8]

Abstract

A method and system for rank-driven exploration of large structured datasets is disclosed. A Bottom-up Algorithm for Rank-Aware Clustering (BARAC) splits matches into different labeled groups to guide users in their exploration process. In addition, the BARAC algorithm and framework are applicable to data coming from different domains with a large number of heterogeneous attributes.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 36% of the total text.

Method and System for Rank-Driven Exploration of Large Structured Datasets

Abstract

A method and system for rank-driven exploration of large structured datasets is disclosed.  A Bottom-up Algorithm for Rank-Aware Clustering (BARAC) splits matches into different labeled groups to guide users in their exploration process.  In addition, the BARAC algorithm and framework are applicable to data coming from different domains with a large number of heterogeneous attributes.

Description

Disclosed is a method and system for effective rank-driven exploration of large structured datasets.  A Bottom-up Algorithm for Rank-Aware Clustering (BARAC) splits matches into different labeled groups to guide users in their exploration process.  The BARAC algorithm is a bottom-up subspace-clustering algorithm.

The pseudo-code for BARAC algorithm is presented below.  The BARAC algorithm includes three subroutines namely, a BuildGrid, a Merge and a Join.

The BuildGrid procedure (Algorithm 2 shown below) starts by computing a score for each item i Є D, and then sorts the items in decreasing order of their score.  As the dataset is scanned, all distinct values for each attribute ai Є A are recorded as domain(ai).  Thereafter, each attribute ai with a corresponding domain (ai) is considered, and a grid data structure that is an array of one-dimensional histograms is computed.  Each histogram bucket will store a ranked list of items that fall within the interval it describes.  If ai is a categorical attribute with no natural ordering on its values (e.g. religion), a histogram bucket is created for each value vj Є domain (ai).  If ai is numerical (e.g., age) or ordinal categorical (e.g., body type), domain (ai) is broken down into at most maxBuckets intervals of consecutive values.  Bucket j for attribute i is denoted by grid[i][j].  After establishing interval boundaries for attribute ai (lines 3-12), best N items in S(D) from among those that fall within the range of the interval (lines 13-15) are assigned to each interval.

Further, the Merge procedure runs multiple passes of the procedure OnePassMerge (Algorithm 3 shown below) and builds an adaptive dominance based grid.  OnePassMerge takes the grid as input and expands the search space of the algorithm by considering, and possibly merging, runs of neighboring histogram buckets along the same dimension.  Once the first run of OnePassMerge is completed, OnePassMerge is invoked again on the output grid, and explores merging additional intervals.

For an attribute ai, it may be beneficial to consider the concatenation of two neighboring one-dimensional intervals grid[i][j] and grid[i][k] in the subsequent Join phase, in addition to considering both of them separately.  Additionally, this is because the top-N items of the concatenated interval are sufficiently different from the top-N items of the individual intervals, presenting additional clustering opportunities.  If, however, one of the intervals domi...