Fault-tolerant distributed hierarchical termination detection with fault reporting
Publication Date: 2014-Jan-17
The IP.com Prior Art Database
Disclosed is a system, designed for distributed applications, that efficiently enables an arbitrary tree of dependencies (including finish dependencies) between tasks across multiple processes. The system allows the detection of failure of any processes within that chain, while continuing to preserve the dependencies between tasks of the remaining processes and inform the parent task (of a task at a failed process) of said failure.
Page 01 of 5
-tolerant distributed hierarchical termination detection with fault reporting
tolerant distributed hierarchical termination detection with fault reporting
Modern distributed applications consist of multiple processes scattered across multiple host operating systems, interacting with each other in a coordinated way to produce results. Process A can spawn one or more tasks or activities at a remote process B and wait for responses. The tasks at B may internally spawn and issue requests on processes C and D, possibly waiting for those to "finish" before responding to B . A
"finish" dependency requires that all spawned tasks (including, recursively, tasks spawned by these spawned tasks) terminate before the parent task can progress . The ordering of process dependencies, associated locations within the distributed system, and the side effects of each process on state may be arbitrary depending on the data , system load, or other factors from run to run.
The novel contribution is a system that efficiently enables an arbitrary tree of dependencies (including finish dependencies) between tasks across multiple processes in order to detect the failure of any processes within that chain , as well as continue to 1) preserve the dependencies between tasks of the remaining processes , and 2) inform the parent task (of a task at a failed process) of said failure.
Previous work in this area can handle failure detection and reporting of failure for configurations of tasks scattered across multiple processes with a single "finish" at the root. The described approach is a unique method for handling an arbitrarily deep and complex chain of dependencies and preserving ordering in an efficient and distributed manner.
Description of Algorithm
The waiting-for-completion relationship between a task and its child(ren), is stored in the memory of the parent task (i.e. the task that is waiting for those child tasks to complete before continuing). This means that the dependency graph between all parents and children is distributed in the memory of those same tasks . In the event of a failure of some process, there may be surviving parent tasks, as well as surviving child tasks. Both the children and parent of the failed task may have useful work in progress , and it is necessary to continue execution by preserving what remains of the computation. To accomplish this, the parent of the failed task inherits the dependencies of the failed child. It "adopts" its grandchildren, bringing the dependency state of the failed task into its own dependency state . This allows the parent task to
wait for its indirect dependencies, performing termination detection on the grandchildren, preserving overall execution order. In addition, the parent task is notified of the failure of the child, enabling it to perform actions to recover from the failed task , based on the needs of the overall program.
The novel implementation stores the dependency state information for t...