Browse Prior Art Database

Software Counters for First-Failure Data Capture

IP.com Disclosure Number: IPCOM000116498D
Original Publication Date: 1995-Sep-01
Included in the Prior Art Database: 2005-Mar-30
Document File: 4 page(s) / 143K

Publishing Venue

IBM

Related People

Ratcliff, BH: AUTHOR

Abstract

Disclosed is a method for using counters imbedded inside of software programs to aid in debugging. By using software counters at various points in the normal software flow and in all error conditions, the past and present status of the system can be determined. The software counters are continuously updated as the system is running. When an error occurs or an abnormal path in the code is taken, the associated counter will be incremented. This keeps an eternal log of all system activity. This information is never overlaid or lost.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 41% of the total text.

Software Counters for First-Failure Data Capture

      Disclosed is a method for using counters imbedded inside of
software programs to aid in debugging.  By using software counters at
various points in the normal software flow and in all error
conditions, the past and present status of the system can be
determined.  The software counters are continuously updated as the
system is running.  When an error occurs or an abnormal path in the
code is taken, the associated counter will be incremented.  This
keeps an eternal log of all system activity.  This information is
never overlaid or lost.

      The software counters cause no performance degradation to the
system and they are always functioning.  When an error occurs for the
first time, a substantial amount of data is available to determine
the cause of the error.  This greatly increases the odds of solving
an error on its first occurrence.  At worst, the system programmer
will have very useful information to use in debugging the problem to
determine the next step without having to run various traces for an
undetermined amount of time and wait for the problem to recur.

      The debug and first failure data capture methods for most
products today involve the use of a trace facility.  This trace
facility requires a large memory area or file to store the trace data
and also affects the performance of the system.  On Local Area
Network (LAN) adapters, this becomes a problem because of the lack of
storage and the performance impact of running the trace.

      A major problem on many systems has been the gathering of
useful information on the first occurrence of a failure.  This is
known as first failure data capture.  Most trace facilities are not
run during "normal" production periods because of the performance
impact.  Therefore, on the first occurrence of an error, no
information is available.

      A second problem is the wrapping of the trace table after a
failure occurs.  Many times problems are intermittent and the initial
failure is not easily detected at the system level.  By the time a
user realizes the problem has occurred and tries to save the trace
table, the trace table has wrapped and all relevant information about
the error is lost.  This is especially true on small systems like LAN
servers and adapter microcode.

Software counter example - In the 3172, software counters were used
for debugging when the performance of trace was too degrading and
masked the error from occurring.  The specific function in the 3172
where these software counters were used was in the transmission and
reception of frames of the Fiber Distributed Data Interface (FDDI)
media.

      In the transmit flow, four steps were involved.  The first step
was the notification by the HOST operating system a frame was ready
to be sent on the fiber.  The second step was the moving of data from
HOST memory to the FDDI shared memory.  The third step was the moving
of the da...