Browse Prior Art Database

System and method for monitoring IT operation process

IP.com Disclosure Number: IPCOM000215217D
Publication Date: 2012-Feb-22
Document File: 3 page(s) / 62K

Publishing Venue

The IP.com Prior Art Database

Abstract

This disclosure is about using XML or other format to describe the expected action, resource usage and result of the IT operation process. Define a reference execution sample by testing or exercise. In production environment, the monitoring engine monitors the operation process, resulted status and performance metrics in the system and reference them with the template and execution sample. It can find difference of execution and reference value to generate alerts. Then give the alerts to admin.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

System and method for monitoring IT operation process

IT systems often encounter issues that need recovery. For example, when a machine is broke, a standby machine need to start and take over the workload. If the system is a database, the operation needs to be redo at the backup machine and all connections from first machine to the second machine. Some daily maintenance operation has similar characteristics like data backup, virus checking.

The operation process usually include multiple steps which span a period of time and can has issue which may make operation fail, For example, a mirrored disk may suddenly broke, the requests need to goes to another disk and when a new disk inserted, a synchronization between the mirrored disk is started. The synchronization operation may uses IO and block access to the original disk that makes the application no usable to customers.

As the system getting more critical, complex and uses more community components, more complex operation processes are happening in the system and could cause big issues in the system like the issue happened in Amazon service outage. The Amazon incident was mainly caused used IO operation jammed other operation for data replication when re-creating the duplicate data when a disk was broken.

It could reduce big issues caused by issue of recovery process with a proper functional recovery monitoring method. In the recovery process, most situations and actions are new to the system, admin and users. P...