Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Recovery of a 3990 Storage Path Hang Timeout Check-1 Without Fencing

IP.com Disclosure Number: IPCOM000121459D
Original Publication Date: 1991-Sep-01
Included in the Prior Art Database: 2005-Apr-03
Document File: 3 page(s) / 121K

Publishing Venue

IBM

Related People

Jesionowski, LG: AUTHOR [+2]

Abstract

The slow recovery of 3990 Storage Path Hang Timeout check-1's requires the Storage Path to be fenced from all channels. The fence is necessary as an indication to other Storage Paths that a Storage Path is currently unavailable. The fence can result in the system varying Storage Paths offline and/or boxing devices.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Recovery of a 3990 Storage Path Hang Timeout Check-1 Without Fencing

      The slow recovery of 3990 Storage Path Hang Timeout
check-1's requires the Storage Path to be fenced from all channels.
The fence is necessary as an indication to other Storage Paths that a
Storage Path is currently unavailable. The fence can result in the
system varying Storage Paths offline and/or boxing devices.

      The 3990 Storage Path (SP) check-1 recovery is performed by the
Support Facility (SF). Recovery can take anywhere from 15 to over 60
seconds due to recovery requirements (machine reset path, SP Basic
Assurance Tests, SP/SF communication diagnostics, and sometimes an
IML) and poor SF performance (most of the SF code resides in
diskette- resident overlays due to a limited amount of SF memory).
While an SP is in check-1 recovery, other SPs may attempt to access
data in the Storage Control Array (SCA) to which the recovering SP
owns the lock.

      If the non-recovering SPs were to wait on a recovering SP to
free an SCA lock, the non-recovering SPs would, in turn, take a Path
Hang Timeout check-1 in as little as 0.5 seconds. Thus, an indication
was necessary for the non-recovering SPs to recognize that an SP was
in recovery.

      The SP fenced indication was chosen due to hardware
limitations.  Thus, an SP is fenced during all check-1 recoveries and
all SPs look for the fence bits as an indication that another SP is
"unavailable".

      To summarize, it is necessary to fence an SP that is recovering
from a check-1 due to the relatively long amount of time that the
recovery takes.

      This fencing can cause problems. The first problem is that the
channel now sees a Condition Code 3 (CC3) during SP recovery rather
than Control Unit Busy. Due to time-out differences, this can result
in a system varying an SP offline that later recovers successfully.

      Fencing is also a problem in the case of channel failures. The
SP microcode is designed to hang on certain channel failures. The SP
is recovered like any other Path Hang Timeout check-1. However,
system processor hardware failures can cause channels to fail on
multiple Storage Paths, and, thus, the possibility of all Storage
Paths CC3 to all channels exists. Multiple paths CC3 can result in
devices being boxed from the system.

      If the check-1 recovery can be accomplished in less than 0.5
seconds, the fencing of the SP would not be necessary. This can be
accomplished using a 'primitive' protocol in order to recover an SP
which takes a check-1 due to Path Hang Timeout. A primitive protocol
is allowed in this case because a Path Hang Timeout does not normally
indicate a control unit hardware problem and, thus, does not have the
same recovery requirements as true control unit hardware check-1's.
The following steps are taken in this protocol:
    (Note: Register bit numbering convention is industry standard.)
 1.  An SP takes a Path Hang Timeout ch...