Browse Prior Art Database

A method for Managing Node Failures in a Storage System with Multiple Paths from Clients to Data

IP.com Disclosure Number: IPCOM000012376D
Original Publication Date: 2003-May-01
Included in the Prior Art Database: 2003-May-01
Document File: 2 page(s) / 42K

Publishing Venue

IBM

Abstract

A method and program product is disclosed for policy-based optimization of "shared nothing" multi-node file servers. Multi-node file servers, eg, a NAS with several computational engines capable of serving files to clients, are often structured as "shared nothing" systems: each filesystem (or volume, if block services are provided) is available at any given time through at most one node. When that node suffers a failure, the filesystems it serves become unavailable until the multi-node file server finds a node that can pick up fileserving duties (the "failover node"). The invention solves the problem of finding that node in a way that balances unavailability against performance and the possibility of secondary failures due to increased computational load (stress) on the failover node(s).

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 54% of the total text.

Page 1 of 2

  A method for Managing Node Failures in a Storage System with Multiple Paths from Clients to Data

A method and program product is disclosed for policy-based optimization of "shared nothing" multi-node file servers. Multi-node file servers, eg, a NAS with several computational engines capable of serving files to clients, are often structured as "shared nothing" systems: each filesystem (or volume, if block services are provided) is available at any given time through at most one node. When that node suffers a failure, the filesystems it serves become unavailable until the multi-node file server finds a node that can pick up fileserving duties (the "failover node"). The invention solves the problem of finding that node in a way that balances unavailability against performance and the possibility of secondary failures due to increased computational load (stress) on the failover node(s).

The invention comprises:

a method for detecting failures combined with

a method for classifying failures combined with

a method for taking action based on the classification.


1.


2.


3.

For expository purposes, the invention's purpose is explained with the aid of three failure classes: 1) hardware failures, 2) software failures repairable simply by rebooting; and 3) software failures repairable by rebooting and a perhaps time-consuming repair procedure; and a three-node system serving six filesystems as in:

Node A: serves filesystems 1 and 2 Node B: serves filesystems 3 and 4 Node C: se...