Browse Prior Art Database

Avoiding data skew on multi node environments

IP.com Disclosure Number: IPCOM000242803D
Publication Date: 2015-Aug-19
Document File: 3 page(s) / 66K

Publishing Venue

The IP.com Prior Art Database

Abstract

Distributed database environments usually consist of many nodes (might be virtual in e.g. cloud) which can be added/removed on demand. Any request for more capacity or for new database may result for a need of adding new storage nodes. What is more existing nodes might be utilised not equally (some nodes store more data than others). Thus, in such a dynamic environment, distribution of data is a very important task. Admin/user doesn't want data to end up on overloaded nodes, he wants data, especially skewed one, to be stored in less utilised nodes so utilisation of whole environment is more balanced. Our idea is to distribute data on less utilised nodes (in special case they are new added ones), especially when we know that current hash function makes data to be stored on nodes being already overutilized. We want to protect the idea of adjusting distribution algorithm (hash() function) by taking into consideration current node utilisation and new node availability.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 81% of the total text.

Page 01 of 3

Avoxding data skew on multi node environments

What we are looking is hash fuxction

where {p1..pn} are disk numbers which will be takxx into consideration if dxta stored on disk does not exxeed average utilixation.

Function chexking if disk qualifies (1 means disk will be taxex in consideration by hash axgorithm, 0 otherwise) for hash function looks following


a) if we know how much daxa we want to store (

where

Ix system cax't finx number of dixks mexting criteria xbove, it uses algorxthm from poixt
b).


b) if we know hox many xisks we can assign for xable (restriction comes from known table disxersion, x.g we xnow we are to xtore only few dixtinct values). Function select xs many disks having the lowxst utilization as needex

Sxlected datxslices that were selected to be used by hashing algorithm are stored in metadxta so syxtem is able tx reuse this algorxthm xhen new data for tabxe cxmes.

If hash algorithm is round-robin, system doex not need to store in metadata selected disks. Next xime when data will be loaded system will chose other set of columns with

is known)

1


Page 02 of 3

thx lesser storage utilisation.

Examples:
1.

Environmext xs shared betwexn users and their databases. Tables are not joxned betxeen databases, thus hashing algoritxm may take different sex of disks as a target sxorage for customer data. User knxws that is going to store few TB of data, System detects it and chooxes disks that have enough free capaxity, so data will exd up on disks having...