Browse Prior Art Database

An optimal key-value store using hash table

IP.com Disclosure Number: IPCOM000249050D
Publication Date: 2017-Jan-31
Document File: 7 page(s) / 156K

Publishing Venue

The IP.com Prior Art Database

Abstract

The proposed method in this research disclosure is based on Separate chaining with linked lists strategy but extends it to provide constant lookup times and avoids skewed bucket distributions. The solution is to ensure that the least loaded buckets are used for new insertions. This criterion ensures that that per bucket load in approximately even. This is a non-trivial criterion as this usually breaks the stateless characteristics of the hashing – we now need to save the key to bucket mapping. The proposed method uses probabilistic data structures for set membership queries to overcome this limitation.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

1

An optimal key-value store using hash table

A hash table allows to efficiently access associative data . Each entry is a pair of a key and a value , and can be quickly retrieved or assigned just by knowing its key. For that, the key is hashed using a hash function , to transform that key from its original representation into an integer. This integer is then used as an index to identify the bucket in the bucket array from which the entry’s value can be accessed. Many keys can hash to the same values, meaning that these keys will be in collision in the bucket array. To resolve collisions, various techniques can be used, such as separate chaining with linked-lists, or open addressing with linear or quadratic probing.

Collision resolution

Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. Some of the common collision resolution strategies are described below:

1. Separate chaining with linked lists

2. Open addressing with Linear/Quadratic probing probing

3. Cuckoo hashing

4. Hopscotch hashing

Drawbacks:

For separate-chaining, the worst-case scenario is when all entries are inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries, so the worst-case cost is proportional to the number n of entries in the table. The drawback of open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array. Other schemes like cuckoo hashing etc have limitations in terms of supported load factor.

Problem:

The current collision resolution strategies do not provide an optimal balanced hash bucket list distribution and suffers on worst-case lookup costs and table rebuilding. The solution suggested here avoids skewed bucket distributions (like fig below) for separate chaining with linked lists

2

The proposed method is based on Separate chaining with linked lists strategy but extends it to provide constant lookup times and avoids skewed bucket distributions.

The solution is to ensure that the least loaded buckets are used for new insertions. This criterion ensures that that per bucket load in approximately even. This is a non-trivial criterion as this usually breaks the stateless characteristics of the hashing – we now need to save the key to bucket mapping. The proposed method uses probabilistic data structures for set membership queries to overcome this limitation.

Architecture:

The architecture of the method is as shown in the diagram below. It uses a set of Bloom Filters [Ref 1] to store the buckets associated with each input. A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either "po...