Browse Prior Art Database

Method for detecting gradual fan failure in high availability platforms and computers

IP.com Disclosure Number: IPCOM000019983D
Publication Date: 2003-Oct-15
Document File: 3 page(s) / 106K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for detecting gradual fan failure in high availability platforms and computers. Benefits include increased high availability and improved system reliability.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 54% of the total text.

Method for detecting gradual fan failure in high availability platforms and computers

Disclosed is a method for detecting gradual fan failure in high availability platforms and computers. Benefits include increased high availability and improved system reliability.

Background

         In managed-chassis and other high-availability products, fan speed is continuously monitored. When preprogrammed thresholds are exceeded, an alarm event is generated.

         Fans conventionally have very low mean time between failure (MTBF). Early detection of fan failure enables replacement of the faulty fan/fan tray before it can impact service. Fan speed monitoring and alarm event generation can be performed either by an intelligent fan tray itself or by a shelf/chassis management controller.

         In high-availability card modular platforms, fan speed is normally set below 100%, such as at 80%, under normal operating conditions. In case of fan/fan tray or facility air conditioning failure, the working fans are run at 100% speed to provide additional airflow. Fans are not typically run at 100% to limit acoustic noise and improve the MTBF. Fans are conventionally run at 100% speed only in case of a over temperature condition resulting from a failure.

         Most fan failures are gradual and result in gradual speed reduction. Assume fan/s is running at a speed less than full speed, say 80%. Gradual deterioration of the fan’s bearings and lubricant are not detected until they cause the fan speed to drop below 80%. As long as the fan speed is above the preset lower limit, no alarm event is generated. The gradual deterioration of the capability of the fan to run at full speed when needed remains undetected. An undetected failure can cause a single point of failure, which impacts the availability of the entire platform.

         For example, a situation can occur where a fan’s bearings are deteriorated, but its speed remains above the lower limit. If the facility air conditioning or a fan fails in this state, the remaining fans are supposed to speed up to 100% speed to compensate for the failed fan and the resultng lower airflow. However, because of the undetected deterioration of the bearings, one or more fans are unable to reach full speed. This condition results in the overheating of the chassis and its contents. By the time an operator can be dispatched to replace the failed fan/f...