Browse Prior Art Database

Time-Out/Voting Algorithm for Resource Contention in N-Way Multiprocessing Systems

IP.com Disclosure Number: IPCOM000046482D
Original Publication Date: 1983-Jul-01
Included in the Prior Art Database: 2005-Feb-07
Document File: 4 page(s) / 21K

Publishing Venue

IBM

Related People

Daly, JC: AUTHOR [+2]

Abstract

A significant MP-related problem that a System Control Program (SCP) has to handle in its two-way multiprocessing (MP) support is failure to obtain a cross-processor (global) resource. There are protocols to prevent programs on different CPUs in an MP system from simultaneously using or updating critical system-data fields. This protocol assigns a resource (or lock) to each such data field, and any program wanting to use that data should obtain the related resource first. If a program on one CPU (e.g., CPUA) cannot obtain a resource, it will usually wait for the resource to become available.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 44% of the total text.

Page 1 of 4

Time-Out/Voting Algorithm for Resource Contention in N-Way Multiprocessing Systems

A significant MP-related problem that a System Control Program (SCP) has to handle in its two-way multiprocessing (MP) support is failure to obtain a cross- processor (global) resource. There are protocols to prevent programs on different CPUs in an MP system from simultaneously using or updating critical system- data fields. This protocol assigns a resource (or lock) to each such data field, and any program wanting to use that data should obtain the related resource first. If a program on one CPU (e.g., CPUA) cannot obtain a resource, it will usually wait for the resource to become available. The problem occurs when either the CPU that holds the resource (in this case, CPUB) has a software or hardware problem that prevents it from releasing the resource, or the resource itself has been damaged (overlaid) so that it appears (to CPUA) to be held (by some other CPU) when it is not.

In present two-way systems, the initiating program (on CPUA) waits and retries its request for some period of time, and if it is still unsuccessful, calls a notification routine to issue a message to the system operator. This message identifies the event that is being awaited, and asks the operator to specify either that the SCP should continue to wait for (and retry) the event, or should initiate processing on CPUA to take CPUB offline. (CPU's are removed offline via Alternate CPU Recovery (ACR). Removal of CPUB would presumably resolve the problem that was originally delaying the program on CPUA. Unilateral removal of the "other" processor is acceptable because when one CPU (e.g., CPUA) cannot get a resource or response within its timelimit, there is only one other CPU (i.e., CPUB) that can be the source of the problem. When the operator reply specifies that a CPU should be taken offline, CPUB is the only possible target.

In N-way MP systems, the problem of waiting for resources is more complex. Consider a 4-way MP system with CPUs A, B, C and D. Even if CPUA must wait an excessive length of time for a resource held on CPUB, it is not certain that CPUB is the real source of the problem. Either of the following could have occurred:

l.CPUB cannot release the resource needed by CPUA because it (CPUB)

is waiting for another resource held by CPUC. Since CPUA called

the notification routine for its dependency on CPUB first, the

dependency of B on C may not be made known to the operator

(because the notification routine cannot be executing on more

than one CPU at a time). In this case, taking B offline in

response to the operator's reply would be incorrect because the

primary cause of the delay is C.

2.CPUA has waited the specified time and has been unable to obtain

a resource held by CPUB. Even though B holds the resource at a

particular point in A's wait, it does not necessarily mean that

B is the problem. Between the retries made by A for the resource,

1

Page 2 of 4

it could...