Browse Prior Art Database

Prevention of Unnecessary Disruptive Sysplex Recovery Actions

IP.com Disclosure Number: IPCOM000104969D
Original Publication Date: 1993-Jun-01
Included in the Prior Art Database: 2005-Mar-19
Document File: 2 page(s) / 82K

Publishing Venue

IBM

Related People

Enichen, MC: AUTHOR [+3]

Abstract

Disclosed is a method for preventing unnecessary disruptive recovery actions in the computer system complex (sysplex).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 53% of the total text.

Prevention of Unnecessary Disruptive Sysplex Recovery Actions

      Disclosed is a method for preventing unnecessary disruptive
recovery actions in the computer system complex (sysplex).

      When multiple computer systems are connected in a sysplex, it
is the responsibility of each system to regularly signal other
systems or record in a central file that it is operating correctly.
When one system notices the absence of this signal ("heartbeat"), it
may choose to perform recovery action on behalf of the system whose
signal is missing.

      When an operating system goes into prolonged recovery, it may
not be able to issue the sysplex heartbeat for an amount of time that
exceeds the sysplex tolerance, thus causing sysplex recovery to take
place (resulting in a restart and/or re-IPL of the system that is
trying to recover, or, at the very least, resulting in that system
being partitioned out of the sysplex).

      Waiting for an operating system to recover may be faster and
less disruptive than re-IPLing it.  Unfortunately, other sysplex
members do not know the status of the recovering system because the
recovering system cannot update the Couple Data Set (CDS, the central
data set where system status is maintained).  Thus the other sysplex
members have to assume the worst.  The recovering system knows its
state and the expected recovery duration, but cannot communicate this
information to other sysplex members.

      Consider a sysplex configuration where some (not necessarily
all) sysplex systems have their service processors (SVPs) connected
via a networking mechanism such as a LAN.  (The systems with SVPs so
connected will hereafter be referred to as a "cluster.") Consider
that a prolonged recovery situation occurs in the operating system
running on one such system (e.g., SYS1).  The recovering operating
system does the following:

1.  Issues an order to its service processor (SVP1) to reject any
    orders for disruptive actions (such as IPL or System Reset) which
    originate from...