Browse Prior Art Database

Tuning HA cluster on failover to maintain service level agreements.

IP.com Disclosure Number: IPCOM000243597D
Publication Date: 2015-Oct-05
Document File: 3 page(s) / 77K

Publishing Venue

The IP.com Prior Art Database

Abstract

A novel approach to maintaining service level agreements following a failure of one or more servers in a cluster. Workload balancing is commonly used in enterprise computing systems. Here workloads are distributed across multiple computing resources, aiming to optimize resource use and maximize throughput. Such systems can be simple, for example follow a round robin approach, or can use more complex algorithms. In such systems there is an up-front decision made by a "load balancer" to determine an appropriate server to which a request should be sent. The invention described below aims to maintain an SLA via tuning adjustments and can work in harmony with load balancing systems.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 45% of the total text.

Page 01 of 3

Tuning HA cluster on failover to maintain service level agreements.

High Availability (HA) is a common business requirement in which near-continuous application availability is provided through both planned and unplanned outages. There are a number of factors which must be considered to achieve this goal. One such characteristic is to eliminate single points of fail-over, adding redundancy and reliable transfer of workload to alternative systems. Maximizing usage of available hardware is also desirable, thus it is common to have multiple servers processing

workload in a normal mode of operation, and following failure of one (or more) server(s) the remaining server(s) support the operational workload whilst the failed server is brought back online.

    During normal operation it is important to tune the system to achieve optimal performance and also ensure that any dependent systems are not overloaded. So for example, consider the following simple configuration:

    Server A and Server B support an HA configuration with an application deployed to each and workload shared in normal operation.

The system has a dependency on a back-end system, e.g. a database.

    Server A and Server B have their workload tuned using a threadpool of 10 threads (20 in total) to enable concurrency and to meet SLAs, but also to limit

workload to avoid overloading the back-end system.

    Server B goes down and server A (assuming available capacity) is expected to pick up the additional workload. However, with Server A tuned for normal

operation, i.e. throughput is restricted, it is unlikely that SLAs will be met.

    In a scenario as described above, SLAs could be achieved through modified tuning of available hardware. Once failure of one (or more) server(s) in the system is detected, remaining servers in the group are automatically tuned such that the total number of threads available to process incoming work is the same as when running in a normal mode of operation. So, in the above example, server A would have the size of its threadpool increased to 20 threads to enable SLA. Back-end systems are unaffected; their total work is unchanged following the server failure. In a more complex configuration, where multiple servers are available in a cluster and

available resource on each of these servers differ, negotiation takes place between servers to determine available resource such that the total capacity is maintained.

    Once the failed server(s) are brought back online, tuning of the remaining servers within the group would be returned to the original configuration

    Detection of server failure is well understood. At its simplest it could be responding to a timeout event or implementing a heartbeat mechanism. In a simple scenario, servers can communicate directly with sibling servers to detect failures. In a more complex system, a 'cluster manager' can detect failure of cluster members. Complex HA products are also commercially available.

    Once a server failure has been detect...