Browse Prior Art Database

A System and Method for Dynamic Failure Categorization for Cloud Application Incidents

IP.com Disclosure Number: IPCOM000237356D
Publication Date: 2014-Jun-16
Document File: 2 page(s) / 75K

Publishing Venue

The IP.com Prior Art Database

Abstract

The complex failure modes for large software applications often make it very hard for system operators to effectively diagnose and rectify problems. Application incident diagnosis is challenging because of complicated causes aggregated from shared environment, network, hardware, software, and dynamic changes of enterprise IT. Unlike the traditional IT applications, cloud applications provide much better support for monitoring the IT applications. To address the problems of application incident diagnosis and resolution, we propose a novel approach to analyze the root cause of the incident in the cloud environment. Our proposed approach first captures the essential data for failure snapshot, and then conducts static failure categorization to estimate the candidate categories of the failure, and finally use dynamic methods to identify the category of the failure.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 2

A System and Method for Dynamic Failure Categorization for Cloud Application Incidents

The core idea of proposed approach includes three major components:
(1) Capture essential data for failure snapshot
Cloud applications provide much better support for monitoring the IT applications. We leverage this advantage to capture the following data to facilitate further analysis: 1) Failure-causing event history of theapplication (Event data for short); 2) Failure-causing application data (Context data for short); and 3) Configuration of the application (Configuration data for short). The probes are deployed in cloud environment to continuously monitor the runtime application, and cache the recent data. Once an incident is reported, the recently cached data will be stored as failure snapshot.


(2) Static failure categorization
The static failure categorization compares the feature of new incident with the feature of existing incidents. This is done through pattern extraction and matching from the following patterns: event sequence pattern, context data pattern, and configuration patterns. This step aims to check whether the incident is a re-occurring incident, if so, we can use the resolution knowledge of previous occurred incident (which has already been resolved) to resolve this new incident. If this ticket is a not a re-occurring incident, we will use dynamic evaluation to find the root cause and resolution method.


(3) Dynamically identify the failure category
This step aims to identify the failure category through simulating the environment and rerunning the test data to check if same failure occurs. We first identify the candidate list of failure snapshots, and then prepare VM from each failure snapshot. After the VM is configured and started, we inject the event sequence and observe the system behavior, and finally cross-check the monitored system behavior and the target system behavior. Based on the similarity of m...