Browse Prior Art Database

Method For Handling Potentially Critical Errors

IP.com Disclosure Number: IPCOM000023310D
Original Publication Date: 2004-Mar-29
Included in the Prior Art Database: 2004-Mar-29
Document File: 3 page(s) / 59K

Publishing Venue

IBM

Abstract

A method is described for handling power faults in a system with redundant power components. The method handles both failure of a power component and protection from data loss during a primary AC power source drop-out.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 3

Method For Handling Potentially Critical Errors

In a computer system which allows detection of Early Power Off Warnings (EPOWs or impending AC power failures), a potential problem arises when redundant power supplies are used.

When using redundant power supplies, the system is designed to continue operation in the event of a single power supply failure. The operating system (OS) regards this failure as a non-critical event. It is notified of the failed component and schedules a service call for replacement of the defective part. Normal system operation continues but the system remembers that it is operating in degraded reliability mode. Once the defective power component is replaced the system returns to full functionality.

Under normal operating conditions a failure of the AC power source is detected by the power supplies and reported to the OS. This situation is reported whenever the power supply senses that AC power is out of specification and there is less than a few milliseconds of power prior to total loss. The OS regards this as a critical event and takes immediate action to protect customer data. This action consists of halting all further operations with the storage sub-system so that data writes are not in progress when power is finally lost in the system. The system then begins to repeatedly poll the power supplies waiting for the power to either return to specification or fail completely.

In an AC outage, the two power supplies may detect and report the AC loss at slightly differing times due to internal sensing differences and manufacturing tolerance. The time lag between the two power components reporting the EPOW may be exaggerated by slow loss brown-out conditions. When a single power supply reports an EPOW, one of two things may be happening. AC power is either really about to be lost or else the power supply has developed an internal problem. The response to these conditions is quite different. In the first case AC power is failing and the OS must be alerted so that actions to protect data are executed. In the second case the OS may be told about the failure at leisure so that a service action may be scheduled. The system will also enter degraded reliability mode. The problem is compounded by the fact that under some conditions either event is intermittent. This makes the determination of the true causal event much more difficult.

The solution to this problem is to treat all power sub-system problems as if they are critical until such time that the problem can be determined to be less severe. In the case of EPOWs, this means treating a redundant failing power supply the same as an AC loss until the true nature of the problem is discovered. Then the failure may be reported and repair scheduled as usual.

The following flow chart shows the process for treating both error conditions in a system with redundant power components.

Page 2 of 3

The algorithm depicted in the flow chart may be im...