Browse Prior Art Database

Intel CPU IERR Filtering Algorithm and Fault Isolation Methodology

IP.com Disclosure Number: IPCOM000030691D
Original Publication Date: 2004-Aug-23
Included in the Prior Art Database: 2004-Aug-23
Document File: 3 page(s) / 37K

Publishing Venue

IBM

Abstract

The internal error (IERR) signal on Intel microprocessors (CPU's) can be triggered by a number of external events in addition to an Internal CPU Error. An algorithm and methodology is disclosed here that filters IERR events so that unnecessary replacement of CPU devices can be avoided and the actual causes of the event can be isolated.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 49% of the total text.

Page 1 of 3

Intel CPU IERR Filtering Algorithm and Fault Isolation Methodology

     CPU's manufactured by Intel Corporation have an internal error or IERR signal that is usually tied to an interrupt line or to an external monitor. The original purpose of this signal was to indicate an internal, unrecoverable CPU error. The normal procedure is to replace any CPU that signals an IERR. More recently, the IERR signal is also triggered by non-CPU faults. Bus time-out's, forward progress stalls in multi-processor configurations, evaluation versions of Windows operating systems and other non-hardware fault triggers have been identified that result in an active IERR signal. On occasion, two or more CPU's signal an IERR simultaneously. Experiences in the laboratory have shown that the CPU(s) can be restarted and the IERR often does not re-occur. In the field, the result is that non-faulty CPU's are replaced first and the real source of the IERR is discovered only after multiple IERR events and one or more CPU replacements.

     The IERR Filtering Algorithm and Fault Isolation Methodology differentiates IERR events that are the result of hardware faults from IERR events due to other causes. It enables a methodology for isolating the causes of the events that trigger the IERR signal. IERR events are filtered based on known causes, indicators, and most probable events and the system is restarted automatically in most cases. This filtering algorithm and methodology takes advantage of supplemental information, fault probabilities and prior knowledge about IERR events to determine when a likely false CPU internal error has been signaled.

     The IERR Filtering algorithm can be implemented as shown in the following flow diagram. In this flow, the H8 is an external processor monitoring the server system.

    Start (From H8-Reset)

Reset all IERR flags

 I-Error Detected?

Waiting for I-Error No Yes

Restart all CPUs

Power Cycle system Restart all but faulted CPU

Yes

No

1. H8:Reset/Reboot System (With no additional CPU's held off)
2. BIOS: Logs MC Regs to MM
3. BIOS: If 3-strike set send msg to MM
4. BIOS: Clears MC Regs

Single CPU Active?

Yes

Log error msg #1 Clear IERR flags Turn Failed CPU LED ON Turn on System Error LED

No

Log error msg #2

No

H8 Rec. Ack from

BIOS

Yes

Set faulted CPU IERR flag Clear other CPU IERR flags Turn on System Information LED

Log error msg #3

Multiple CPU IERR?

Same

CPU IERR flag set?

Yes

No

No

Clear CPU IERR flags Turn on System Information LED

No

BIOS: Sends status to H8

Time Out

3 Strike Bit Set?

Log error msg #4 Clear IERR flags Turn on System Information LED

Yes

Yes

Turn on System Error LED Log error msg #5 Turn CPU LED ON Clear IERR flags Power Down

clp 10--16-2003 rev. wes 11-12-2003

In this implementation, the system runs until an IERR signal is detected by the

1

[This page contains 1 picture or other non-text object]

Page 2 of 3

H8. The H8 warm restarts the system. On reboot, system BIOS retrieves the contents of the machine check registers loca...