Browse Prior Art Database

Global Snapshots for Distributed Debugging

IP.com Disclosure Number: IPCOM000245647D
Publication Date: 2016-Mar-24

Publishing Venue

The IP.com Prior Art Database

Abstract

This invention is to implement a system to trigger and collect a whole picture of debug info on all the related applications running in a distribution system.

This system allows for all system artifacts including nodes, communication channels, events, log or instructions to be uniformly dumped by relations. Developers debug a complete program by analyzing the dump info with fine-grained instrumentation that is capable of exposing instruction-level information.

The system supports both auto-detect and user-defined process topology, adopting a master-slave framework to ensure getting full picture dump or debug information based on the defined process topology, which may happen at the time when issue raises.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 26% of the total text.

Page 01 of 14

Global Snapshots for Distributed Debugging

Key concepts:

Distribute System

It is a software system in which components located on networked computers communicate and coordinate their actions by passing


messages. The components interact with each other in order to achieve a common goal.

Snap

Snap-
--Shot

Shot

An assembly of process's CPU, memory, stack, I/O, FD, and Connections using states etc., which can be collected cross hosts in a distribution system.

Process Topology

A process topology may be defined as a group of processes that have a predefined regular interconnection topology such as a farm , ring, 2D mesh or tree.

Problem

Problem: ::

Debugging programs on a non-distributed system is a fairly well-understood task. A good interactive debugger supports breakpoints and single-step execution for a line-by-line analysis of the effect of procedures and instructions on program state, all in the context of the original source code. At any moment in time, the user can halt execution and examine any aspect of the program's state, tracing the relationship between source code and error symptoms at whatever level of detail desired.

To scale to today's complex distributed software systems, debugging and replaying techniques mostly focus on single facets of software, e.g., local concurrency, distributed messaging, or data representation. This forces developers to tediously combine different technologies such as instruction-level dynamic tracing, event log analysis, or global state reconstruction to gradually explain non-trivial defects.

This paper proposes a debugging system that provides iterative and interactive homogeneous debugging for debugging distributed systems . Unlike other approaches, however, this system allows for all system artifacts including nodes, communication channels, events, log or instructions to be uniformly dumped by relations. Developers debug a complete program by analyzing the dump info with fine-grained instrumentation that is capable of exposing instruction-level information.

Existed solutions:


Replay

Record a log of the process execution, and replay it later with both forward and reverse

execution commands.

1



Page 02 of 14


Instrument code

By adding some simple, easily-removable, (usually) well-isolated instrumentation calls to source code, developers can quickly enhance program flow and identifying where code is going. Then trying to reproduce the same issue and record process running status via instrument code for investigation.

None of above provides polices for multi-dimensional resource de-fragmentation and fragment management.

Nowadays distributed systems are evolving rapidly, they are spreading over large scale of hosts and application relationships, because of their scale these systems are difficult to develop, test, and debug. When bug happens on these system, they are always difficult to track down, because the bugs exhibit themselves only at a certain scenario (especially the moment when t...