Browse Prior Art Database

A Method of SMP-SMP Expansion Module POST/BIOS Boot Monitoring and Failure Recovery using Independent Flash ROMs

IP.com Disclosure Number: IPCOM000020621D
Original Publication Date: 2003-Dec-04
Included in the Prior Art Database: 2003-Dec-04
Document File: 4 page(s) / 55K

Publishing Venue

IBM

Abstract

On multi node scalable systems such as the IBM xSeries x440 and x445, it is possible that a single node might fail POST/BIOS execution which would result in the entire partition to fail to boot. Disclosed is a hardware/software solution to allow a partition to recover from such an error to maximize system availability to the end user.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 56% of the total text.

Page 1 of 4

A Method of SMP-SMP Expansion Module POST/BIOS Boot Monitoring and Failure Recovery using Independent Flash ROMs

  Figure 1 below illustrates a single system which is composed of two Symmetric Multiprocessor (SMP) Expansion Modules which are commonly called nodes.. Each node has it's own memory, processor(s), and

  POST/BIOS Flash ROM. The two nodes are connected to each other through the scalability controller and I/O controller chipset.

  Furthermore, the two nodes share a common LPC interface to the system Baseboard Management Controller (BMC), which is the responsible for monitoring and logging system
and chassis events.

  On initial power on, both nodes will fetch from their respective Flash ROM and begin to execute POST/BIOS. As such is common across all IA32 systems, the POST/BIOS firmware is divided into multiple initialization and test routines where each routine is equated to unique "Checkpoint" number. As a part of the routine calling function, each node will register the checkpoint of the routine which is about to be invoked along with a corresponding node number (0 for bottom, 1 for top) to the BMC. As a part of this same calling routine, each node will retrieve the current checkpoint of the opposite node. If the checkpoint has changes since the last time it was checked, the node will generate a new timestamp for the new checkpoint, otherwise the node will determine if a time-out threshold has been exceeded. This condition would indicate that the opposite node has failed to return from the specific checkpoint and is considered to be in a locked or frozen state.

  At this point, the node which is considered to be locked can be recovered from the opposite node by issuing a reset to the loc...