Browse Prior Art Database

Method for detecting cooling degradation in a sever system Disclosure Number: IPCOM000244717D
Publication Date: 2016-Jan-06
Document File: 5 page(s) / 532K

Publishing Venue

The Prior Art Database


The efficiency of the cooling subsystem degrades due to dust, wear and tear, and blockage. Therefore, after a couple of years of operation, running fans at a certain speed may not provide the same level of cooling (when compared to when the system was shipped). The proposed method seeks to characterize and detect cooling degradation in a system. This enables adoption of corrective techniques, depending on the extent of degradation detected, and thereby improves overall reliability of the system.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 47% of the total text.

Page 01 of 5

Method for detecting cooling degradation in a sever system

1. Introduction:

Air cooling is still the most widely used method for thermal management in a server system. Factors like dust, wear and tear, and blockage are known to impact the performance of a cooling system. As a result, fan running at a particular speed (RPM) provides lower cooling than when the system was shipped. This causes reliability concerns, performance degradation, and energy loss. The cost of a cooling system in data center has been increasing over the years. Existing work also indicate this now approximately equal to the total computing cost in enterprise applications. Therefore, it is essential to keep the cooling system operational without much degradation.

Typically in a server system there are more than one cooling components, say comprising of multiple fans, and multiple modular refrigeration units (MRU). In such scenarios, the system should be able to detect the degradation of each of cooling component. It should also be able to detect the extent of degradation that has occurred at a particular point of time. An appropriate method is required to detect and quantify cooling degradation on the field to enable timely corrective action, depending on extent of degradation. Prior work [1, 2, and 3] focused on monitoring system's cooling profile based on thermal resistance model or predefined control parameters. The proposed method performs system cooling degradation characterization in a controlled (say hardware lab) environment. The calibrated information and workload binaries are then packaged along with system vital product data and firmware. An offline characterization scheme is triggered on the field, which thereby enables cooling degradation detection on the field.

2. Overview and Description:

Figure 1 shows a typical cooling subsystem in a server system. Air is flowing from fan module from left to right and crossing through memory, processor and IO sub-system. Fan modules consist of multiple fans. Also, in a server system there could be many such fan modules. Fans are often controlled by thermal management system, which controls the fan speed based on system power and temperature. Over the course of time fan performance may degrade due to dust accumulation and other wear and tear. Figure 2 shows an illustration of a degraded fan, which in turn causes a temperature hot spots in the system.


Page 02 of 5

Fig 1

Fig 2

The proposed technique can be performed in the following phases and steps:


Page 03 of 5

A.Perform system cooling degradation characterization in a controlled environment:

1- System is kept initially in a pre-defined configuration/condition. (Example: Cooling controls, rail voltages, frequencies).

2- A constant utilization workload is run, targeting a particular sub-system in the system.

3- Outlet temperature or on-chip temperature (To) are measured for a given inlet temperature (Ti), for the system (and for the different sub-systems).