Vigil: Segregating the SMI by Node
Original Publication Date: 2003-May-07
Included in the Prior Art Database: 2003-May-07
Multinode systems such as the IBM xSeries eServer x440 are designed for massive redundancy and power. However, due to current design and defaults in the multinode system, is the way that System Management Interrupt events are processed. For example, if we were to have four (4) nodes connected with redundant memory on each node as well as a full array of processors on each node (eight (8) physical per node) when an error occurred on one of the nodes that generated an SMI, all of the interconnected nodes would enter the SMI. If the error was catastrophic, the SMI Handler would generate a machine check - forcing a reboot of all of the nodes. However, the causing event may not affect the other nodes and, if the node is not affected, the reboot is not required. This will impact the overall system throughput. What is needed is a way to isolate the errors and SMIs by node in an efficient manner. This invention takes advantage of the current scalability chip architecture to isolate the System Management Interrupt from being propagated from one node to another. By preventing the SMI from being propagated, the individual nodes can handle the SMIs independently and not impacting the performance of the non-affected nodes. From the operating system perspective, if the system is a fully loaded two node system (8 processor per node) and an SMI is generated by the one node due to an error condition, those 8 processors will go into the SMI while the other 8 processors continue to handle their tasks. This allows for SMI handler to perform system independent error correction without halting the entire multinode system. However, if the SMI is not propagated to all nodes, a method is required to involve other nodes if the original node determines that the condition requires other nodes to recover from the condition.
Vigil: Segregating the SMI by Node
Disclosed is a system and method that allows multi-node operation not to be impacted by an event occurring on a single member node unless the affected node determines that other nodes are required to process the event. Typically, in a multinode environment of the present invention, see Figure 1, an interrupt such as a SMI on a single node, will be propagated across the interconnections that connect the individual nodes together. However, each SMI causes not only the node experiencing the causing event to stop normal processing but all the physical nodes. In order to prevent propagation of the SMI interrupt from one node to all other physical nodes, a control field is available in the scalability chip in each node. At power up, POST in each node will set the control field to disabled to prevent propagation of the SMI interrupt into other physical nodes. This makes sense in most cases, as a condition triggering a SMI is typically isolated to a single physical node, independent of the other nodes attached. For example, a local hardfile went bad.
However, for significant conditions that may impact multiple physical nodes, disabling the propagation of the event across the multitude of nodes, will cause the system to misbehave. To gain the benefits of disabling the propagation of the interrupt across the interconnection, a new method is required that allows propagation of the SMI interrupt to other physical nodes only when it is required. For example, if memory experiences an outage on one node and another node is sharing the physical memory, the second node must be warned of the outage. This can be accomplished by the SMI handler in the originating node.