Browse Prior Art Database

Method to determine thermal degradation early and react before thermal temperature limit triggers

IP.com Disclosure Number: IPCOM000236393D
Publication Date: 2014-Apr-24
Document File: 5 page(s) / 45K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method in the area of system level thermal management to identify any thermal based degradation of integrated circuits in order to prevent any potential service disruption of computers

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 5

Method to determine thermal degradation early and react before thermal temperature limit triggers

Background:

In any computer systems, all the major ASIC are monitored for junction temperature and heat dissipation. Different types of heat removal ranges from normal conduction of heat into the air to aided methodologies of heat sink to reduce the resistance offered to conduct the heat. In certain places, active heat sink involving fan on top of the ASIC is also employed. Also in many computer systems, especially in servers, a separate management controller, to sense temperature at different ASIC's, change the fan control in a reactive as well as predictive manner to provide better efficiency in power. When the ASIC's undergo thermal cycling due to workloads switching over a period of time, causing the components to undergo thermal degradation. Also due to process issues, there are cases where certain circuits fuse and influence the adjacent circuits to also go down causing an avalanche behavior. This behavior when exhibited during the product development life cycle, the root cause can be ascertained and fixed. On a customer environment, these issues not only bring down the system but also affect the business continuity.

Idea:

In any computer systems, involving integrated circuits that consume more power and dissipate heat, an effective way to identify thermal degradation can be achieved by tabulating the temperature of the different sensors present all over the ASIC / system at different power levels for that particular sub-system at a given ambient temperature and Fan speed/ altitude

Base-lining this data and fusing information to the management controller that controls the thermal integrity of the ASIC/system in degrees/watt standard

Two possibilities, Raw data comparison method


Standardizing as degree/watt measure for extrapolation and calculation

Sampling the system at real-time for the temperature of various sensors and power levels for a standard workload and comparing the temperatures generated and offset from the normal temperature seen during base-lining

Alternately, measure the power to the device directly in an on-going basis and compare to see if there is any drastic change in temperature for that power

Trigger a caution to multiple sub-components of the system in order to indicate a possible thermal degradation

Reacting to the trigger by kicking off various adjustment scripts that need to modify their margins based on the temperature degradation

1


Page 02 of 5

Extending beyond a specified limit, invoke a machine shutdown in a contr...