Assessing Impact of an Unscheduled Database Outage
Original Publication Date: 2002-Dec-20
Included in the Prior Art Database: 2003-Jun-21
Disclosed is an automated and repeatable method for assessing the impact of an unscheduled database outage. The method involves measuring the restart time of a database after an outage is forced in a test environment that mimics the production environment. This method allows the user to tune the environment after each iteration to then observe the impact to restart time. One of the largest problems database users face in an e-commerce environment today is assessing the length and the cost of an unplanned downtime. A direct correlation can be made between the length of a downtime and the cost of that particular downtime. For instance, a database down time for a popular on-line store could cost the business tens of thousand of dollars an hour, could cause the loss of customer confidence, hurt the reputation of the business in the marketplace and could cause the loss of future sales. Not only is a cost assessment needed but a repeatable and controllable process is needed as well such that the database administrator can make changes to the system (hardware upgrades, DB2 configuration parameter tuning, etc.) and then rerun their assessment to ensure the changes that were made sustained a decreased downtime. The solution is to write software that starts up a database product, such as DB2, and runs a suite of applications, which simulates the load that may be encountered in a production environment, against that database. After a preset amount of time, an abnormal termination is executed which causes the database to crash (i.e. kill -9 db2sysc). Then a timer is started and the database is brought up. This timer will indicate how long it takes (maximum, minimum and median times) for the database to be restarted. Repeat this process a number of times. Store the values and take an average. This provides an average time it will take (and minimum, maximum ranges) for the database to be restarted and thus come up with a cost estimate. This would give a statement that maintaining the status quo will incur a cost of a x b for an outage (where a is the average duration of the downtimes and b is the estimated cost of an outage). With this baseline data, a database administrator can run some experiments and assess the benefit of making changes to the production system. For instance, upgrade some of the memory in one machine and see if it decreases restart time. If it does, then a cost savings estimate can be made and whether an investment in new hardware is warranted (i.e. the cost of buying new hardware is less then the cost of the difference in down times, or put differently, the cost of buying new hardware is offset by improvements in system availability). With the baseline estimates, database configuration parameters can be changed and their effect on restart times can be evaluated prior to using them in production. A cost saving statement is produced in the case as well. This solution allows users to build a predictive model for determining the amount of downtime by capturing database activity, such as the number and types of statements, the number of applications, etc.