Using a Fake Lane Kill to Simplify RAS and Reduce Size on a High Speed Serial Interface
Original Publication Date: 2009-Nov-24
Included in the Prior Art Database: 2009-Nov-24
Described is a method of using a fake lane kill to simplify RAS and reduce size on a high-speed serial interface.
Ȉˇ ˄ ˙ ˝ ˛
Ȉˇ ˄ ˙ ˝ ˛ Ȉˇ ˄ ˙ ˝ ˛
˚ ˛ ˇ
˚ ˛ ˇ ˚ ˛ ˇ
RAS has become one of the most important design considerations in many ultra high-speed interfaces. RAS recovery, however, can get very complex as many different scenarios can occur. Let's take a look at using a high-speed interface to connect to a DDR3 memory controller. In particular, let's look at Intel*'s SMI interface. The interface can get four types of errors, in theory:
(detected by the memory controller) - lane kill
- single or burst of CRC
- Southbound(detected by the DDR3 controller)
- lane kill
- single or burst of CRC
The memory controller is the master of this bus, and the DDR3 controller is the slave. If there is an error Northbound (to the memory controller), the DDR3 controller does not need to take action, and the memory controller is free to do any recovery it wants, including asserting fail, retrying the command, poisoning the data, etc.
If, however, the error is detected on the Southbound interface, this is a problem since the master doesn't know about it. The SMI architecture defines the error handling such that the Southbound device signals an error status to the Northbound device. However, to prevent an ordering problem, once the Southbound device sees an error, it drops all subsequent commands. So, those commands need to be reissued by the master. However, what if BOTH the Northbound and Southbound links have errors at the same time?
First let's look at the difference between a lane kill and a CRC error, from the northbound side. If a Southbound lane failed, the north memory controller would see a constant error status that it couldn't clear. If a Northbound lane failed, the north memory controller would see a constant CRC error.
Typically, the constant CRC error case is handled by retraining the link after a number of CRC errors are received, which indicate that a sync command has been missed.
Note that the
link is being retrained even without knowing for sure if there was a lane kill. Since the SI properties of the link are getting bad, it's beneficial to re-center the link to a stable state. The sequence of events that could cause a dual lane kill is as follows:
Northbound lane kill occurs,
causing CRC errors
2) Then, before the back-to-back CRC threshold initiates a link retrain, a southbound lane kill occurs
3) The Southbound device attempts to send an error status, but this gets corrupted going
The problem here is that a dual lane kill can cause a data integrity exposure if it's not dealt with. To prevent a data integrity error, using the 'normal' solution, the threshold of CRC errors would be reached and the memory controller would retrain the link and resend all the commands that potentially didn't reach the DIMMs. However, this is very expensive. Read commands get a positive acknowledgement that they reached the DIMMs once data is returned to the host. However,...