Browse Prior Art Database

Method And System For Optimizing Statistical Count Estimation For A Full Text Search Using Multiple Full Text Collections In A Content Management System

IP.com Disclosure Number: IPCOM000227991D
Publication Date: 2013-May-31
Document File: 5 page(s) / 30K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system for optimizing statistical count estimation for a full text search using multiple full text collections in a content management system is disclosed. The method and system optimizes the performance and reliability of the count estimate when a minimum threshold is required for the count estimate.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 27% of the total text.

Page 01 of 5

Method And System For Optimizing Statistical Count Estimation For A Full Text Search Using Multiple Full Text Collections In A Content Management System

Disclosed is a method and system for optimizing statistical count estimation for a full text search using multiple full text collections in a content management system. The statistical count estimation is optimized based on a requested minimum threshold of full text search hits. Full text indexing can store indexed data into separate files which requires searching across multiple collections. The method and system integrates formulas into algorithms for a cross-collection estimate that includes a determination of

when to abort an estimation. The estimation is continued across collections when the count is on track and is discontinued when the count is not on track in reaching the requested threshold by assuming uniformity of search hits within the collections. The formulas specify the most efficient accuracy range to use in order to obtain a reliable count estimate across multiple collections when a required threshold must be exceeded.

In a scenario, a decision of either searching the full text repository first which is also known as Content Based Retrieval (CBR) or the relational database first is determined. The decision is determined within a content management system which also stores metadata for the full text indexed objects in a relational database. The decision is determined based on the size of the full text count estimate. If the estimate exceeds a specified threshold, then the search can be executed more efficiently by executing the database query first.

Count Estimation Algorithm

The count estimation algorithm for a multi-threaded software application is as follows.


1. Insert the requested threshold into a variable so that each worker thread can access it. This variable is termed the running stop limit L. 2. Generate a list of all collections accessible by the search and order them by descending size.


3. Assign worker threads to the largest collections first. Each worker thread then

performs a count estimate using the running stop limit L and an accuracy range value that is provided by the calculation steps that follow.

The full text software will stop estimating in a collection when it reaches the stop limit. The accuracy range is used by the full text software to limit the search range of random numbers, which have been assigned to each document. For example, the accuracy range can vary from 100 to 10000, where 100 will be the slowest, but more accurate search, and 10000 will be fastest, but least accurate.


4. As each worker thread returns, its count estimate is subtracted from the running stop limit L. If the solution is negative, then no more workers are started and the threshold is known to have been exceeded. Other active worker threads will finish later and will have no effect on the solution. If the solution is greater than zero, a new worker thread is started...