System and method for monitoring IT operation process Disclosure Number: IPCOM000215217D
Publication Date: 2012-Feb-22
Document File: 3 page(s) / 61K

This disclosure is about using XML or other format to describe the expected action, resource usage and result of the IT operation process. Define a reference execution sample by testing or exercise. In production environment, the monitoring engine monitors the operation process, resulted status and performance metrics in the system and reference them with the template and execution sample. It can find difference of execution and reference value to generate alerts. Then give the alerts to admin. The disclosure help admin to make sure the recovery process goes on as planned and alert admin to take extra actions when the operation process did not go as planned. It helps to avoid more disasters in the error recovery process. It helps admin and users to aware situation and better prepared in the new situations.

IT systems often encounter issues that need recovery. For example, when a machine is broke, a standby machine need to start and take over the workload. If the system is a database, the operation needs to be redo at the backup machine and all connections from first machine to the second machine. Some daily maintenance operation has similar characteristics like data backup, virus checking.

The operation process usually include multiple steps which span a period of time and can has issue which may make operation fail, For example, a mirrored disk may suddenly broke, the requests need to goes to another disk and when a new disk inserted, a synchronization between the mirrored disk is started. The synchronization operation may uses IO and block access to the original disk that makes the application no usable to customers.

As the system getting more critical, complex and uses more community components, more complex operation processes are happening in the system and could cause big issues in the system like the issue happened in Amazon service outage. The Amazon incident was mainly caused used IO operation jammed other operation for data replication when re-creating the duplicate data when a disk was broken.

It could reduce big issues caused by issue of recovery process with a proper functional recovery monitoring method. In the recovery process, most situations and actions are new to the system, admin and users. P...