Browse Prior Art Database

Error Detection/Reconfiguration in Fault-Tolerant Communication Subsystem

IP.com Disclosure Number: IPCOM000036746D
Original Publication Date: 1989-Oct-01
Included in the Prior Art Database: 2005-Jan-29
Document File: 3 page(s) / 46K

Publishing Venue

IBM

Related People

Basso, C: AUTHOR [+4]

Abstract

A method is described for detecting, in a Communication System architecture with distributed microprocessors and shared memory, the fault of a given microprocessor by all the others. The shared storage is structured in pairs of banks. Also described is how to guarantee that all the microprocessors use the same side of a pair of banks in case of bank fault or bank access path fault.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

Error Detection/Reconfiguration in Fault-Tolerant Communication Subsystem

A method is described for detecting, in a Communication System architecture with distributed microprocessors and shared memory, the fault of a given microprocessor by all the others. The shared storage is structured in pairs of banks. Also described is how to guarantee that all the microprocessors use the same side of a pair of banks in case of bank fault or bank access path fault.

The considered Communication System is composed of Line Interfaces (LIs) sharing a Packet Memory (PM). The LIs are microprocessor- based, and connect lines and protocols of various types. The PM allows the LIs to communicate, and contains common data and programs.

The objective of the Communication System is to be fault-tolerant. This means that it should not disrupt established sessions when faced with a single hardware fault and should be able to reconfigure automatically within a 1-second time range. This is achieved by two methods: PM fault tolerance and LI fault tolerance.

A - PM fault tolerance: The PM consists of a set of banks organized by pairs. In each pair there is one bank of side A and one bank of side B. Each pair of banks is independent from the other pairs. A bussing system provides 2 independent paths between a given LI and the 2 sides of a pair.

A fault-tolerant PM object has two instances, each in a different bank side within a pair. When a LI issues a write operation, it establishes 2 paths, writes the record into one bank, then writes the record into the other bank. For a read operation, the PM side within a pair is selected at random as both sides contain the same information:

B - LI fault tolerance: LIs are also duplicated and a given line enters two LIs; one is active and controls the lines, the other one is back-up and has its interface to the lines disabled. It has been initialized in order to take over in case of failure of the captive one. In addition, back-up software processes for software processes active in a given LI are distributed in LIs other than the back-up LI itself. 1. Detection and reconfiguration for LI faults

If an unrecoverable LI fault is detected, the LI is stopped, which initiates distributed selection and reconfiguration. The distributed detection technique (Fig. 1) is as follows: The microprocessor of each operational LI issues, via

a monitoring task, a special PM command called 'G...