Browse Prior Art Database

Method for cache directory integrity checking in distributed cache storage subsystems.

IP.com Disclosure Number: IPCOM000015904D
Original Publication Date: 2002-Aug-16
Included in the Prior Art Database: 2003-Jun-21
Document File: 3 page(s) / 46K

Publishing Venue

IBM

Abstract

Disclosed is a method for validating the integrity of a Fast Write Cache in a distributed storage subsystem. The Problem This invention relates to storage subsystems which cache write data in a "fast write cache" and which have the property that two or more independent "nodes" within the storage subsystem each keep a copy of the cache in order to guarantee no single point of data loss. Fast write caches are well known in the industry. Most storage subsystems contain a fast write cache. A fast write cache has the property that it returns good status to the using computer system before it has transferred the written data to underlying permanent storage such as disk drives. The written data is held in memory within the storage subsystem. Typically the memory is made non volatile by some use of battery technology so that in the event of a power loss, data is not lost. If data is lost from the fast write cache, this will typically invalidate all data within open volumes in the storage subsystem. A commonly used architecture which guards against this catastrophic event is where two or more independent "nodes" within the storage subsystem each keep a copy of the cache. If either node is lost due to failure, the other node still contains the written data and can destage the written data to disks.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 39% of the total text.

Page 1 of 3

  Method for cache directory integrity checking in distributed cache storage subsystems.

Disclosed is a method for validating the integrity of a Fast Write Cache in a distributed storage subsystem.

The Problem

     This invention relates to storage subsystems which cache write data in a "fast write cache" and which have the property that two or more independent "nodes" within the storage subsystem each keep a copy of the cache in order to guarantee no single point of data loss.

     Fast write caches are well known in the industry. Most storage subsystems contain a fast write cache. A fast write cache has the property that it returns good status to the using computer system before it has transferred the written data to underlying permanent storage such as disk drives. The written data is held in memory within the storage subsystem. Typically the memory is made non volatile by some use of battery technology so that in the event of a power loss, data is not lost. If data is lost from the fast write cache, this will typically invalidate all data within open volumes in the storage subsystem. A commonly used architecture which guards against this catastrophic event is where two or more independent "nodes" within the storage subsystem each keep a copy of the cache. If either node is lost due to failure, the other node still contains the written data and can destage the written data to disks.

     Whilst the arrangement described is quite a simple architectural concept, the implementation is complicated by the fact that storage subsystems must deal with many thousands of concurrent operations and deal with a vary large number of possible scenarios both in terms of the pattern of reads and writes and in terms of error recovery scenarios.

     One common error recovery scenario is often termed failover. Simply stated, failover is the process which occurs when one node fails and the remaining node(s) take over it's workload and/or stored data.

     A related scenario is fail-back which is the process by which a previously offline node is brought back into the subsystem and assigned workload and/or data to manage.

     It is critically important that the data logically stored in the subsystem as a whole does not change as a result of failover or failback. If it were to do so, the result would be that the using computer systems could potentially read back data which is different to that which was written. This would result in catastrophic data loss or corruption.

     In order for the data stored in the subsystem not to change during the failover it is critical that all nodes in the storage subsystem have a consistent view of the data at all times.

     Ensuring the correctness of the design of such storage subsystems is a matter of good design and implementation practice. One such practice is that of defensive design. It is good defensive design practice to ensure that critical systems are largely self checking, or have a self checking mode which can be activated during development....