Browse Prior Art Database

Method and System for Smarter Resource Management for Distributed Deep Learning

IP.com Disclosure Number: IPCOM000248258D
Publication Date: 2016-Nov-11
Document File: 4 page(s) / 132K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system is disclosed for an automated and smarter task assignment and role-aware placement of tasks to achieve high locality and performance for training of deep learning models using the best possible combination of resource offers from a distributed cluster.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 4

Method and System for Smarter Resource Management for


Distributed Deep Learning
Resource management in distributed deep learning requires new locality guarantees. Further, the workflow in distributed deep learning is not provided in map-reduce style.

FIG. 1 illustrates an existing framework for resource management in distributed deep learning.

Figure 1

As illustrated in FIG. 1, the resource management of the existing framework requires fairness and data locality and the distributed deep learning also requires parameter server (PS) - worker locality. Further, the existing framework requires manual task placement by a programmer and only provides static resource assignment to tasks.

Disclosed is a method and system for an automated and smarter task assignment and role-aware placement of tasks to achieve high locality and performance for training of deep learning models using the best possible combination of resource offers from a distributed cluster.

The method and system places tasks with an initial task assignment across the cluster to achieve fairness and locality in a two-phase process.

The parameter server tasks are placed across the cluster to achieve fair distribution across the cluster in phase one and

worker tasks are placed to achieve co-location with the corresponding parameter server within same node or rack in

1


Page 02 of 4

phase two.

Further, the method and system explores all possible combinations of resource offers and computes the cost model for different levels of parallelism such as, but not limited to, inter, intra-thread within a model replica and the number of training model replicas.

Also, the method and system requests for resource adjustment based on comparison of execution costs of different resource offer combinations
Further, the method and system reassigns resource offers to tasks based on a resource readjustment request to achieve optimal execution cost for training the deep learning model.

FIG. 2 illustrates a framework (DIKE) for resource management for distributed deep learning in accordance with an embodiment of the method and system.

Figure 2

As illustrated in FIG. 2, DIKE includes the following components: Task Definition component, Role-aware Task

Assignment and Placement component an...