Mechanism for a Component driven, life-Cycle based, on-demand trace.
Original Publication Date: 2005-May-17
Included in the Prior Art Database: 2005-May-17
Mechanism for a Component driven, life -Cycle based, on-demand trace.
The common mechanisms available today for problem determination in software are either trace or First Failure Data Capture (FFDC). FFDC collects a snapshot of a portion of the program's state only after a problem has occurred which is often not enough for full problem determination. On the other hand trace has to be enabled and the problem reproduced in order to collect data pertaining to it. Due to the typical performance impacts of trace, it can be a serious burden on a customer's production system and it may even preclude reproducing certain classes of problems, namely timing and threading issues.
For problem determination, in addition to an FFDC like snapshot of the program state, what is needed is call-flow data to show how the program got to this state. It often makes more sense to collect this data at the object or component level rather than the program-wide level because call-flow data is only relevant during the current life cycle of the object or component. If this collection mechanism is efficient enough, it can be "on" all the time and can record call-flow data from the first occurrence of a problem. This will minimize the instances in which customers have to recreate problems and generate traditional trace data.
The core idea is that the life cycle over which call-flow data is gathered is defined by the programmer. The life cycle for which data might be collected can be bounded by many things, such as the length of a transaction or the life of a database connection. In the transaction case, for example, the relevant data collection would occur from transaction start to transaction end.
Further, the programmer can also define the conditions under which the collected data is saved or discarded. Typically the data will be saved to disk when an error condition occurs. Conversely, should a life cycle complete successfully, then the collected data could be discarded. In the transaction example above, if the transaction completes with an expected commit or rollback, the data might be discarded. However, should an error condition occur, the flight recorder data can be saved to disk for problem determination purposes.
The (notion of the) flight recorder consists of three parts. One is an application program interface (API) that a programmer uses to mark call-flow branch points. Second is a runtime mechanism to gather and store or discard the data from these marked call-flow branch points. The final piece is the post processing mechanism to convert the collected data into a human readable form.
The API is designed to allow for efficient recording of call-flow trace data. Each message or data point to be stored consists of an integer. Integers are handled efficiently in most computer languages and hardware and consume a small amount of memory, typically only 32 bits of data. During the post processing phase, these integer messages will mapped to a message, si...