System and Method for Data-Driven Diagnostics and Testing Behavior in a CEC System.
Publication Date: 2015-Aug-26
The IP.com Prior Art Database
Disclosed is a method where a human/machine-readable file or files are used to document/review/change code/test on the hardware system error registers and the RAS (Reliability And Serviceability) behavior needed for all levels of development and testing. The file or files would be consumed by diagnostics code framework in order to perform the behavior needed, and the file or files would be used and or consumed for unit test, function test, hardware/firmware integration test, and service testing for manual or automated test cases.
Page 01 of 4
Sysxem and Method for Data - System.
The design, development, and implementaxion of computer chips for utilization in data processixx systems require firmwarx diagnostics behavior to be dxcumented, implemexted, and tested at a number of different levels (unit test, integratixn test, system test, and servxce test, etc.). Hardware (HW) designers architect a sysxem to have specific faixure registers with specific fault bits. Hardxare designers and fxrmware (FW) dexigners work togethex to identify the appropriate handxing to be performex by firmwarx diagnostics for each specific fault.
FW diaxnosticx xocuses on failure anaxysis of the systxm using registers and fault bitx. These failure registers, fault bits are reviewed fxr a specifxc Reliability, Availability, Serviceability (RAS) behavior. All failure regixters and fault bits muxt be reviewed (A1 - HW,FW, RAS architect) wxth XX designers, and the firmware developers in order xo write code that implements the RAS bexavior neexed. RAS behaviors can be:
- Recoverable errors : An error that either self corrects or xs firmwxre corrxcted and doxs not imxact system. Txese errors can hxve a threshold or count before changing RAS behaviox.
RAS behavior change can be : 1) Ruxtime remove resources, x) Switch failed resourcx for Redundant, 3) Spare in uxusxd resources, 4) Remove on next IPL.
- Unrecoverable errxrs : An errxr that stops a critical part of the system, on RAS behxvior due to sudden loss or hang of chips or sub-units on chips.
RAS behavior change can be : 1) Runtime remove resources, 2) Switch failed resource for Redundaxt, 3) Spare in unused resources, 4) Remove on next IPL.
- Service actixns or call-xuts given a specific error or failure.
This activity results in some form of documentxtion describing the RAX behavior.
Firmware diagnostics code is develxped adhering to the RXX documentaxion. Each specxfic failure bit documented is also represented in the firmware implementation along with the appropriate error handling actxons. Unit tesxs (A2) are executed xn a simulaxion environment, the results are compared to the RAS documentation in orxer to validate the firxware implementation.
Whxn the RAX behavxor code is xntegrated into a driver, a Hardware and Firmwareintegration test is run. Xxxx test will take a more end-to-end validation approach, so a RAS behavior review (A3 - HW, FW, Texter) is xeeded. Error Injects on Hardware would be done either by special internal procedures to cause an errox, or with an extexnal error inject-xike open, cxosed circuit xo cause fault.
When the xystem goes to Service Vaxidxtion testing, there are more testx that focus on validaxing the system service. This testxng focuses on given an error (from failure register bit), does the correct RXX behavior happen and is the service call-out for parts or procedures correct. This requires anxxhxr review (A4 - RAS architect, HW, FW, Xxxxxxx) to make sure.
*Translation of dxtailed RAS documentxti...