Method for Pseudo On-line Firmware Updates for Critical Subsystems on PC Servers
Original Publication Date: 2001-Dec-07
Included in the Prior Art Database: 2003-Jun-20
Disclosed is a method for performing pseudo on-line firmware updates of critical subsystem such as storage adapters (e.g. RAID, fiber channel, SCSI adapters, etc), storage devices (e.g hard drives, tape drives, etc), network cards, and other devices on industry-standard PC Servers. Typically these subsystems require any running operating systems to be shutdown and the server restarted before and after applying firmware updates. This method uses a Service Processor to transmit the updates to the server and coordinate the update process. It leverages the power management capabilities and the hot-swap capabilities of the server to be able to perform the updates. To perform an update, the Service Processor will first suspend the operation of all operating systems using power management capabilities to quiesce and flush data buffers and halt the main CPUs. Next, the Service Processor will perform the actual update of the subsystem (for example, by transferring the update directly over the system bus, or by staging the update to be completed by the System BIOS on resume). If necessary, the Service Processor will then reset the subsystem (for example, using hot-plug capabilities for controlling power to the device). Finally, the Service Processor will bring the CPU and all operating systems back on-line using a resume command. During the resume, the System BIOS will have the opportunity to perform any operations necessary to complete the update before the operating systems come back on-line. The result is that the server is only off-line momentarily, but no state is lost and no operating systems are shutdown on the server. It is important that customers have a simple way to apply updates on a regular basis. However, updating critical subsystems often poses a significant challenge to customers because they are required to schedule downtime on their servers in order to perform the updates. Scheduling downtime to reboot a server and perform firmware updates may be next to impossible on mission critical servers, which ironically are the servers that most need to stay up-to-date. This solution provides a method which minimizes the time that the server will be off-line and, most importantly, does not require a lengthy reboot of all of the operating systems running on the server. This problem has been addressed on high-end servers (e.g. IBM 390s) using specialized, redundant hardware in this redundant hardware environment, a particular subsystem can be taken offline for an update, then brought back online without requiring a reboot. This method, however, describes a cost-effective way of providing a similar form of high-availability during updates for industry standard PC servers (which typically do not have redundancy for all subsystems). A number of automated firmware update solutions (e.g. IBM LANClient Control Manager) exist for PC servers, but all of these require the server to be rebooted in order to perform the updates or to invoke the new firmware level.