Browse Prior Art Database

Gridsync: a Data Management and Publishing Appliance for a Cluster

IP.com Disclosure Number: IPCOM000240782D
Publication Date: 2015-Feb-28
Document File: 6 page(s) / 126K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a utility, called Gridsync, to efficiently distribute data files/directories to a set of hosts in a cluster.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 36% of the total text.

Page 01 of 6

Gridsync: a Data Management and Publishing Appliance for a Cluster

Distributing data into a big cluster requires a lot of effort, as does keeping the data up to date, managing the data, and finding hosts having valid data for Service-Oriented Architecture (SOA) applications. This disclosure addresses two main problems. First, the size of a current cluster can easily scale up to thousands of hosts. A tool is needed to help clients efficiently manage and monitor the data to a subset host of a big cluster. Second, tasks in a cluster have data dependency. The data needs to exist in a host before the scheduler can dispatch the tasks to that host.

The novel contribution is a tool to help clients conveniently replicate, manage, and use data in a cluster. Gridsync is a utility to efficiently distribute data files/directories to a set of hosts in a cluster. Gridsync reduces the efforts for, and complexity of, data distribution and management in a cluster. With Gridsync, clients can easily replicate data to a subset host of a cluster and for later management. Gridsync also monitors the data status and ensures the data is up-to-date on each host.

Working with a task scheduler, Gridsync reduces the task waiting time by pre-staging data to hosts even before task submission. For certain types of tasks, the task scheduler can predetermine, even before the tasks are submitted, a subset of hosts on which the hosts can run based on the properties of the tasks. Then, Gridsync can pre-stage the data to those hosts and tasks can run on those hosts later without waiting for data distribution when submitted. Alternatively, Gridsync can transfer data on the task scheduler's demand when it schedules the tasks. The task scheduler can generate a host list for tasks that have data dependency. The list contains the hosts to which the scheduler may dispatch the tasks. After Gridsync pushes data to those hosts, the system informs the scheduler and then the task can run on those hosts that have data in local .

The Gridsync utility includes the following components:


Repository Server (RS): master daemon for dataset management and synchronization. The RS maintains a central "golden" copy of each dataset.


Repository Server Agent (RSA): agent daemon for dataset management and synchronization. RSAs run on all compute hosts in the cluster, and are responsible for maintaining the local dataset copy and distributing datasets to peers .


Datasets: collections of files/directories managed by GridSync, representing the data a client intends to distribute to hosts. The size of a dataset can vary from hundreds of megabytes to a few terabytes.

Dataset Pipelining Transferring

A client can create a dataset in Gridsync and upload the files to this dataset . Clients can also specify the subset host of a cluster

1


Page 02 of 6

to which to distribute the data distribute. At the beginning, all data for distribution is uploaded to the RS. After the upload finishes, the RS...