Browse Prior Art Database

Method of Estimating the Size of a Set

IP.com Disclosure Number: IPCOM000047971D
Original Publication Date: 1983-Dec-01
Included in the Prior Art Database: 2005-Feb-08
Document File: 2 page(s) / 14K

Publishing Venue

IBM

Related People

Wegman, MN: AUTHOR

Abstract

The problem solved is: Given any collection of names, estimate the number of unique names in the collection, making only one pass over the data and using a very small amount of memory. The problem is given a fixed amount of memory and a small amount of computing to answer any sequence of requests of either the form "Union an element x into a set", or of the form "How many elements are there in the set". Unioning the same element twice does not change the number of elements. We propose a technique which will give a close estimate to the number of elements in the set. The naive technique is to keep a list of all emements in the set and, when a union request comes along, check to see if the element is already in the set. If not, add it to the list. The length of the list is the number of elements in it.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

Method of Estimating the Size of a Set

The problem solved is: Given any collection of names, estimate the number of unique names in the collection, making only one pass over the data and using a very small amount of memory. The problem is given a fixed amount of memory and a small amount of computing to answer any sequence of requests of either the form "Union an element x into a set", or of the form "How many elements are there in the set". Unioning the same element twice does not change the number of elements. We propose a technique which will give a close estimate to the number of elements in the set. The naive technique is to keep a list of all emements in the set and, when a union request comes along, check to see if the element is already in the set. If not, add it to the list. The length of the list is the number of elements in it. Elements are chosen from some universe of elements. Suppose we divided the universe in half. To keep track of elements in a set, we would keep only a list of those elements in the set which were in the first half of the universe. The list would be approximately half as long as the previous technique, and by doubling the number of elements in the list, we would have a good estimate of the number of elements in the set. If the list again became too long, we could restrict our attention to only one quarter of the universe, cut the size of the list in half (approximately) and now multiply by four to get an estimate of the actual size of the set. This process may be continued. Every time the list gets too long, halve the size of the universe, and double the multiplier. The way in which the universe is partitioned is important. For any such division, there are sets which will not be estimated correctly. A set of elements completely contained in the segment of the universe being recorded will be overestimated, and the set of elements not in that segment will be under-estimated. Rather than using one fixed partition of the universe, we may look at a class of partitions and determine how good an estimate is provided by members of the class. A class of partitions defines a set of possibly overlapping segments, namely, those segments that the universe is broken into b...