Browse Prior Art Database

Identification and Analysis of Sporadic System Errors

IP.com Disclosure Number: IPCOM000239341D
Publication Date: 2014-Oct-31
Document File: 3 page(s) / 56K

Publishing Venue

The IP.com Prior Art Database

Abstract

This article describes a technique for automating the identification and analysis of sporadic errors.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 3

Identification and Analysis of Sporadic System Errors

By definition, a sporadic error will have no pattern or order and may only repeat after many hours or days. To catch the full diagnosis of the error it would be necessary to have all trace permanently turned on, which is impractical in a production system due to the performance impact of trace and the resulting size of the logs.

    The solution introduces a Sporadic System Detection Algorithm (SSDA) to monitor a system and then predict the next occurrence of the sporadic error so that trace can be engaged for a short time. The key is to minimise the impact on the production system whilst maximising the amount of diagnostic data available to help resolve the re-occurring issue.

    Initially, the new Sporadic System Detection Algorithm (SSDA) monitors the system output for errors. If a specific error is detected more than once with relatively large time gap (as this is targeted as sporadic errors and not high frequency issues than can be diagnosed easily with existing techniques) then the SSDA would begin the analysis. SSDA would automatically turn on increasing metrics until a pattern is found and the occurrence of the error can be predicted. Once a prediction can be made, the full trace can be engaged for a short period (a couple of minutes) to capture all the information whilst minimising the impact on the production system.

    There are several stages to the Sporadic System Detection Algorithm (SSDA). Stage 1) Continuously monitor error logs for error codes. No other analysis is undertaken unless a specific issue or error message is repeated more than once.

    Stage 2) Check the time of each occurrence. If it is at a predictable frequency (e.g. 1:05am every day, for example, due to a virus scan) - then turn on full trace the next day from 1:04 -> 1:06am.

    Stage 3) If the time is not predictable, then begin monitoring low cost metrics, network, cpu, garbage collection, processes running on the system etc. For performance reasons we only keep last 3 minutes of data in memory and save to disk only when the sporadic error reoccurs. Cross correlate the variables with the occurrence of the error to see if a correlation occurs. e.g. CPU stops briefly or starts rising rapidly in the lead up to the problem, or a specific external process starts running. This may enough to predict the next occurrence and dia...