Repeatable Failure Data Capture – Framework to aid in Root Cause Analysis
Publication Date: 2014-Jul-25
The IP.com Prior Art Database
AbstractThis article presents a system for dynamically identifying software components involved in a request processing leading to failure. In order to collect detailed trace/log of execution, the log level for the identified components is changed(to finer) dynamically for the duration of a rerun
Page 01 of 2
Repeatable Failure Data Capture - Framework to aid in Root Cause Analysis
Disclosed is a system for collecting fine grained debug logging information to aid in root cause analysis for failed requests primarily in a distributed system, but applicable to others. Distributed systems involve interaction between multiple components wherein a request flows from one component to another before being serviced. Crossing component boundaries adds to the complexity of debugging to understand the root cause of failure, especially when some of these components could be third-party or hosted services. Logging of application behavior is not only a good software engineering practice but also a valuable tool for debugging. Logging implementations allow for logs to be generated at various levels of granularity, e.g., DEBUG, INFO, ERROR and so on. The gradation from SPARSE to FINEST allows for developers and deployment engineers to choose logging levels to suite their needs. Developers do unit testing in their environment while testers perform different types of tests some to mimic the production environment where the application will be deployed. However, the production environment is still different.
Production systems are sensitive to performance and throughput (often in terms of requests per second) is an important metric. Logging levels are then turned down to minimal level to support the system. Failures in production systems are hard to trace, primarily because of the difficulty in recreating the issue in a developer sandbox environment which does not mimic the production environment and the lack of production level data, hindering identification of root cause of
problems. Often, support tickets stay open for a long time with request for additional logs, instructions on enabling detailed logging of one or more components and/or deployment of hotfix to capture more logs (in specific modules). All this adds to the time take to resolve a ticket keeping some critical defects open in production.
It deals with supporting dynamic logging in such high performance production environments that will enable support and developer...