Browse Prior Art Database

Input/Output Event Analysis

IP.com Disclosure Number: IPCOM000118812D
Original Publication Date: 1997-Jul-01
Included in the Prior Art Database: 2005-Apr-01

Publishing Venue

IBM

Related People

Cox, MC: AUTHOR [+7]

Abstract

When an Input/Output (I/O) event occurs in a large computer installation, such as a serious failure of an I/O resource (channel, cable, switch, control unit, or device), there are usually so many effects of the event that it is difficult to determine the root cause. The effects typically appear in a burst soon after the actual event occurs, so that there is a flood of messages, secondary failures of critical systems or applications, then more messages from those secondary failures. The actual indicators that most closely identify the root cause of the event might or might not appear first or early in the burst due to the asynchronous nature of modern parallelized, multiprocessing, multiprogramming, networked systems.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 15% of the total text.

Input/Output Event Analysis

      When an Input/Output (I/O) event occurs in a large computer
installation, such as a serious failure of an I/O resource (channel,
cable, switch, control unit, or device), there are usually so many
effects of the event that it is difficult to determine the root
cause.  The effects typically appear in a burst soon after the actual
event occurs, so that there is a flood of messages, secondary
failures of critical systems or applications, then more messages from
those secondary failures.  The actual indicators that most closely
identify the root cause of the event might or might not appear first
or early in the burst due to the asynchronous nature of modern
parallelized, multiprocessing, multiprogramming, networked systems.

      The computer support staff is usually very active after an
event to try to determine the root cause of the event and appropriate
repair actions or circumventions.  Because of the "bursty" nature of
the event indicators, oftentimes the staff needs to spend a large
amount of time sifting through all of the event indicators such as
messages, logs, and current status.  Worse, the staff will often
mistakenly treat secondary symptoms of the event, such as application
failures, rather than the main cause because the number or severity
of the indicators are greater, the staff is more familiar with the
applications, and the needs of the business require the staff to
focus on the applications.

      The operations staff needs to identify the cause of the I/O
event quickly, to assess the effect, and to take either remedial
actions or circumvent the problem.

      The solution is to create an agent on each processor in the
network to automatically collect as many of the event indicators as
are possible in a fairly short and immediate period of time, analyze
them as a group, and identify the I/O resource that is the cause of
the event and the actions the operations staff should take.  Present
the results on a dynamically updated graphical depiction of the
affected portion of the I/O configuration where the operations staff
can more quickly comprehend the relationships of the I/O resources,
the effect of the event, and from where they can issue commands to
make changes and determine current status.

I/O Event Indicators

      The major indicators of the event are messages from the I/O
components of the operating system and major subsystems.  Other
indicators are Network Alerts (e.g., Generic Alerts), current I/O
resource status, historical status, and information gathered or
gleaned from associated files or databases containing I/O
configuration information.

Collection of I/O Event Indicators

      In a computer installation running MVS, for example, the
collection of the I/O event indicators is performed using automation
software such as NetView, software Application Programming Interfaces
(APIs) such as that for ESCON Manager, and the hardware interfaces
available t...