Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Computation Reliability as Part of Service Requirements

IP.com Disclosure Number: IPCOM000180190D
Original Publication Date: 2009-Mar-05
Included in the Prior Art Database: 2009-Mar-05
Document File: 4 page(s) / 98K

Publishing Venue

IBM

Abstract

Processors are not 100% reliable. Transient and permanent hardware errors do occur, and may have significant impacts in some application domains, e.g., engineering of large structures. The users of such applications must be protected from such errors as much as possible. The authors discuss maintaining several replicas of the data, some of which may not be up to date, in combination with a policy that dictates how many replicas have to be available, and how many have to be consistent, for information to acceptible. However, we have not found Service Level Agreements (SLAs) or QoS (Quality of Service) definitions that include computations' confidence level.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 1 of 4

Computation Reliability as Part of Service Requirements

Transient and permanent hardware fault detection methods were until recently the concern of high end reliable servers. But as transistor size is reduced by half every 18-24 months (Moore's law), general purpose and low end servers are expected to incorporate detection methods as well. Detection methods vary. Some implement TMR (Tripple Modular Redundancy) and DMR (Dual Modular Redundancy) methods that run the same process/thread more than once on two (at least) processor cores, or an SMT (Simultaneous Multithreading) processor. Some methods use temporal redundancy replicating specific stages of the processor (e.g. the data path and the decode stage). Other detection methods were also proposed, for example via prediction when a rare scenario occurs, such as accessing an un-permitted memory page.

All detection methods have their associated costs (even when no errors occur). Most cost in:
- wasted processor cores (DMR and TMR) or SMT threads (reduces the throughput in half);
- interconnect bandwidth (DMR, TMR and software based methods);
- latency penalty - performance degradation (temporal redundancy, error prediction and recovery capabilities); and
- extra power consumption.

Respectively, the use of fault detection is expensive.

    It is wasteful for an enterprise to maintain an expensive computing system with fault detection if only a small part of the computations really require a very high confidence level. There seems to be a viable business opportunity for providing high confidence computation either by providing Software as a Service, or by providing computing clouds in which some compute engines provide high confidence results.

    Confidence levels are sometimes specified for storage. For example, in "`Dynamo: Amazon's Highly Available Key-Value Store", by Decandia G., Hastorun
D., Jampani M., Kakulapati G., Lakshman A., Pilchin A., Sivasubramanian S., Vosshall P. and Vogels W., SOSP '07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles. The authors discuss maintaining several replicas of the data, some of which may not be up to date, in combination with a policy that dictates how many replicas have to be available, and how many have to be consistent, for information to acceptible.

    However, we have not found Service Level Agreements (SLAs) or QoS (Quality of Service) definitions that include computations' confidence level.

    The core ides of this invention is to include the confidence level requirements in the interface exported by the service (such as in SOA (Service Oriented Architecture) and Grid computing).

    This idea is further generalized to software design contracts. Currently contracts between software interfaces are embodied using propositional logic (e.g., (counter >= 6)&&(timer < 0.3), see design by contract at http://en.wikipedia.org/wiki/Design

_by

                         contract#Description for details). We extend the c...