Recovering a partition on a server outage using cluster aware Virtual Input-Output Servers.
Publication Date: 2016-Jun-02
The IP.com Prior Art Database
Disclosed is an innovative mechanism of restarting and/or Disaster Recovery of a partition on another server when the hosting server goes down. In this mechanism any configuration change to a partition is collected and stored in a Management Logical Unit-MLU created in a cluster. Also the status of the collected data to indicate that the configuration data is up-to-date. The stored partition configuration data can be used by another management console or orchestrator to restart the partition. The MLUs can also be used to store any other book keeping information related to the partitions acting as centralized storage for partition data.
Page 01 of 2
Recovering a partition on a server outage using cluster aware Virtual Input -Output
In a virtualized server environment a host server is managed by more than one management software in an active-active mode managing out of band and partitions are configured in such a way that they can be restarted on another host server upon source host server outage. Also, the IO Virtualization is performed via a special IO Hypervisor, in case of IBM the Virtual Input/Output Server (VIOS) and it is cluster(Storage shared between multiple systems) aware. When the host server is down, management console would need to orchestrate the restart operation and the restart can be performed from any of the management software. In such a case, there are problems with storing/retrieving the configuration information and co-ordinating between the management softwares so as to make sure that the partitions is not restarted on 2 different places which can cause data corruption issues. Also if the data is persisted locally on the management software & it becomes inactive, then restart upon host server outage might not be possible.
Consider another scenario where in there is a virtualized server environment managed by an in-band management software which means the management software is running as/in a partition or virtual machine within the host server. In this case, on the host server outage, the recovery or restart of the partition cannot be done by the management software managing the host server since it was also running within it. So there is need for retrieving/storing the configuration information and providing to another management console to recover the partitions/virtual machines and also co-ordinating between other management software.
Idea is to create a special Logical Unit (Management Logical Unit - MLU) within a Storage Cluster (Shared Storage Pool) which can be accessed by IO Hypervisor on both source (system where outage happens) and target (system where partitions will be restarted) systems. Configuration data for each partition will be collected and stored in the MLU. Each partition information will be stored along with a status information which indicates the restart status for each partition. Any orchestrator performing restart of the partition, would read the status and configuration information from the MLU of the Cluster & also set/reset the status accordingly.
On any configuration change to a partition, collect the required information and store in the MLU created in a cluster. Update the status accordingly based on whether the collected data is up-to-date or not. In case of out of band management, each management software will collect and store data to the MLU via the IO Hypervisor, for instance VIOS incase of PowerVM, if the configuration change is triggered from that management software. Partition configuration will be identified by a combination of Partition Unique Id and System Model-Serial Num. In case of in-band manageme...