A Replication Engine for providing a durable memory system which will withstand hardware and software failures
Original Publication Date: 2000-Nov-01
Included in the Prior Art Database: 2003-Jun-20
Disclosed is a Replication Engine for providing a durable memory system which will withstand hardware and software failures. A network of interconnected computer systems has many advantages. One of the key advantages is an inherent ability to provide continuous availability. A user can be connected to one or more computer systems. These systems are the primary systems for each user. When a primary computer system fails for any reason, either hardware or software, the system resources are no longer available to the user. In a network of interconnected computer systems, users connected to a primary computer system can be switched to a surviving, or secondary system, in the network. A time latency exists for establishing the user environment and applications on the secondary computer system. This time latency is known as the switch-over time. During the switch-over time, neither the primary nor the secondary computer systems are available to the user. Numerous time elements contribute to the switch-over time. These elements may include: detection that the primary system has indeed failed, disk access latency to establish the user working set on the secondary system, and user space initialization on the secondary system. Of the switch-over time elements, the disk access latency may be the major contributor to the switch-over time since the disk latency times are much greater than the CPU processing times. This invention provides a mechanism to completely eliminate the disk access latency from the switch-over time. Disk access latency can be eliminated from the switch-over time by providing a memory system which will a) not fail and b) contain the user’s working set. Numerous options are available for designing memory systems with very low failure rates. An ultra-low failure rate can be achieved by replicating a slice or the entire memory of the primary computer system in a remote node and maintaining the user’s working set in the remote node. With this dual memory system, a variant of a Remote Direct Access Memory (R-DMA) engine, known as a Replication Engine, is required. The Replication Engine is used to maintain an exact replicated image of the memory of the primary computer system in the second computer system.