Browse Prior Art Database

Method and System for Utilizing an Efficient Framework to Retrieve Related Queries

IP.com Disclosure Number: IPCOM000237977D
Publication Date: 2014-Jul-24
Document File: 6 page(s) / 115K

Publishing Venue

The IP.com Prior Art Database

Related People

Amit Goyal: INVENTOR [+2]

Abstract

A method and system is disclosed for utilizing an efficient framework to retrieve related queries. The method and system utilizes a vanilla Locality Sensitive Hashing (LSH) to retrieve the related queries.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 22% of the total text.

Method and System for Utilizing an Efficient Framework to Retrieve Related Queries

Abstract

A method and system is disclosed for utilizing an efficient framework to retrieve related queries.  The method and system utilizes a vanilla Locality Sensitive Hashing (LSH) to retrieve the related queries.

Description

Disclosed is a method and system for utilizing an efficient framework to retrieve related queries.

The method and system utilizes a Locality Sensitive Hashing (LSH) technique and proposes novel variants in a distributed computing environment for instance, but not limited to, Hadoop*.  The method includes the step of identifying several optimizations which improve performance and are suitable for deployment in very

large scale settings.  The proposed variants achieve robust performance with better recall even when same space is used which is determined by number of hash tables.

The method and system focuses on the problem of finding nearest neighbors for a given query from very large scale query logs available from a commercial search engine.  The method includes the step of finding nearest neighbors by performing a small number of comparisons that are sublinear in the dataset size.

When seeking exact matches for queries, effective solutions are based on storing values in a hash table and mapping them using hash functions.  The approximate generalization of this approach is the framework of LSH, where queries are more likely to collide under the hash function if they are more alike, and less likely to collide if they are less alike.  The method and system utilizes the framework within a distributed system and takes advantage of its distributed computing power.

Initially, the method and system utilizes a vanilla LSH algorithm based on a seminal research and proposes four novel variants of the vanilla LSH algorithm.  Here, two of the novel variants achieve significantly better recall than the vanilla LSH by using the same number of hash tables.  The main idea behind these variants is to intelligently probe multiple nearby buckets within a table that have a high probability of containing the nearest neighbors of a query.  The method and system utilizes the framework for efficiently finding nearest neighbors for a given query from a commercial large scale query logs in a sublinear time.  The method and system also focuses on finding related queries and removing near duplicate queries.

The method and system starts with user query logs C having query vectors collected from a commercial search engine over some domain such as, Uniform Resource Locators (URL).  The method and system then measures similarity between queries using cosine between the corresponding query vectors.  The formulated problem is given as a set of queries Q and a similarity threshold as Ƭ.  The method and system develops a batch process to return a small set T of candidate neighbors from C for each query q ϵ Q such that the following conditions are satisfied,...