Browse Prior Art Database

Knowledge Base Structure for Fault Management

IP.com Disclosure Number: IPCOM000113210D
Original Publication Date: 1994-Jul-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 6 page(s) / 222K

Publishing Venue

IBM

Related People

Huang, Y: AUTHOR [+4]

Abstract

Disclosed is the structure of a knowledge base (entities and relations) which is to be used by an expert system to correlate error messages, and to determine the causes of these messages in a distributed system environment.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 24% of the total text.

Knowledge Base Structure for Fault Management

      Disclosed is the structure of a knowledge base (entities and
relations) which is to be used by an expert system to correlate error
messages, and to determine the causes of these messages in a
distributed system environment.

      The rapid growth in the size of Local Area Networks (LAN) poses
new management challenges for the system administrator of such
networks.  Consider an environment of a LAN, consisting of many
hardware components of different brands, and of thousands of client
and server stations running different software products.  In such an
environment, the ordinary system management tasks such as software
version control, system configuration control (both software and
hardware), and fault management become extremely complex.  This
invention disclosure describes a method for simplifying the fault
management of such large LANs.

The following are the basic terms pertaining to the tasks of fault
management:

o   Fault - A fault is the failure of a hardware component or a
    software component to perform its function.

o   Cause - A cause is the reason for the occurrence of a fault.

o   Error Message - An error message is a notification emitted by a
    software component when it detects a fault.

      If a disk crashes then the fault may be the inability to read
data from the disk, the cause is the disk crash, and the error
message may be "unable to open a file system", emitted by the file
management component of the operating system.

      A software fault is normally a bug in some software service.
It may be detected when the service is invoked by the detecting
software through the use of an Application Program Interface (API).

      In principle, the LAN administrator manages faults by
inspecting the error messages.  Based on these messages he determines
the causes of the faults, and corrects or logs the corresponding
faults.

      Assume a large LAN as described in the previous section.
Assume that each hardware and software component on the LAN may fail.
Assume that when such failure is detected by some software module, an
error message is emitted and sent to some designated station where
the LAN administrator may inspect it.  Based on the inspected
messages the administrator does "fault management" as described in
the previous section.

In this scenario two facts have to be noted:

1.  Due to the distributed nature of the LAN, different software
    components, possibly in different stations, may detect and report
    the same fault.  However, each such software component may be
    exposed to a different aspect of the fault.  Thus, each will emit
    a different error message, and no single error message by itself
    will fully describe the real nature of the failure cause.

2.  Even in a single station, different software components may
    observe and report the same fault.  For the same reasons
    descr...