Browse Prior Art Database

Method to Wakeup Users Out Of Wait States in Order to Break Hangs

IP.com Disclosure Number: IPCOM000033024D
Original Publication Date: 2004-Nov-22
Included in the Prior Art Database: 2004-Nov-22
Document File: 6 page(s) / 76K

Publishing Venue

IBM

Abstract

Method of breaking tasks out of a hang situation for a program

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 18% of the total text.

Page 1 of 6

Method to Wakeup Users Out Of Wait States in Order to Break Hangs

Main Idea

Most operating systems provide a means to cancel or force user tasks while they are actively running. This is one way to break a hang (due to an internal error in a system component for example). Since the user has the ability to cancel or interrupt running system functions, it may be the case that a cancel or force is issued when there really is no hang, in these cases its important to make sure a critical system function completes and leaves the system in a consistent state, especially if the system function is responsible for updating permanent data structures. Additionally, if a system component has a hang condition, it would be nice to provide a simple command for that system component that would break any user out of a hang, or have the program automatically handle it. Described here is an algorithm and method to detect and break hanging tasks out of a wait condition (such as a lock wait, an IO wait or some other event wait) and a command to break user tasks out of a hang condition for a system component. These algorithms described here can be used for any system application, hence they do not require updates to the system scheduler or dispatcher (which has greater control of running applications). The main concept is that all lock waits, sleeps, and waits for events not only have the normal poster (the thread that releases the lock, does a wakeup call, posts an event control block (ECB )) but also there is a breaker thread that can run at any time during the processing. Thus algorithms need to be in place to handle an additional posting thread that can appear at any time during a tasks wait. Since an external command is provided, means must be in place to handle the case that a break occurs during normal processing, hence when a task may not really be hung. This is also extremely useful for testing program internal error paths in an automated fashion. There are three parts to the algorithm to facilitate the hang breaking:
1. All lock waits, sleeps, waits on events (ECBs), and IO waits are handled by calling a common TASKWAIT routine that obtains a WAIT_ELEMENT that describes the wait. This structure has fields used to facilitate hang-breaking plus fields to indicate the type of wait (IO wait, lock wait and so on). These WAIT_ELEMENTs may be kept in lists by the lock facility and so it must be ensured the breaker thread is referencing valid WAIT_ELEMENTs, AND that the WAIT_ELEMENT is returned to free storage when it is no longer needed, even if a hangbreak comes in (or a storage leak would occur).
2. A task is represented by a data structure (called a recovery block, or CM_RECOV for short) that is kept in a global list to allow the hang break command to scan all active tasks to detect hanging tasks hanging and break them from their hangs. There is also a free list of available CM_RECOV structures (for future tasks that get created or call the system p...