Browse Prior Art Database

Technique to prevent software failures from occurring in a runtime setup Disclosure Number: IPCOM000238013D
Publication Date: 2014-Jul-25

Publishing Venue

The Prior Art Database


Today's software products are complex enough that software errors or bugs inevitably creep in. There are situations where a software bug exists in a version of the product, and it is alrady fixed by the support team as a software patch. This fix, however, is not yet known to the end customer due to various reasons or he may choose not to apply the patch since he doesn't see the same error symptoms. The bug could lead to a failure if the customer's usage pattern matches that with the error scenarios. This leads to situations where the same error causes breakdowns in multiple end users or customer locations. There is an opportunity for the support organization here to prevent such impending failures in customer locations. This article describes techniques to enhance end user experience of software products by helping end customers to better identify possible breakdowns because of dormant bugs in the code. The bug can be patched and thus the error prevented before it occurs in a end users runtime environment. In a case that it actually occurs on production, it helps in reducing the time lag to identify a fix. The proposed technique uses the configuration information and also the sequence of events that lead to the failure - in terms of logs, software components traversed and and other product data that's generated at runtime, to identify the most appropriate fix. The fix thus identified can either be applied directly or presented to the user.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 29% of the total text.

Page 01 of 11

Technique to prevent software failures from occurring in a runtime setup


Software failures are a reality. There are always some bugs in a software product that evade detection by the product development team. These bugs will be exposed when customers start using the product in their own environment. Quite often we see that customers hit problems with middleware products when their solution is being used live (in production). These kind of middleware failures which are exposed on live systems are the costliest because they may cause unplanned outages of the hosted applications or can cause data on the application to be lost forever.

Whenever a software failure occurs at an end customers environment, there are few ways to search for an available fix. One could be look up the product website and rely on the information to identify patches on your own or contact a representative who can identify if a patch is already available, and get it..

In a live production setup, time is of essence. So, even detecting that there is a bug in the product code (versus mistakes in configuration ), isolating it to a software product within the software stack, contacting the support team or searching for solution - all these take precious time. This is the duration when the live application may be down. The live server, for instance could handling transactions for a bank or stock market , which means it will lead to losses in revenue.

Some of these software bugs are new, and have to go through a cycle of investigation from the

product support. However, there are several instances when the bug in the said software is already fixed by the product development/support team and is published. In spite of this, the failure occurs since the patch may not have been applied - as it was not seen on the customer's

production server. This is usually the norm, as people are averse to modify machines in

production - as long as they are working fine.

Though a problem has occurred in some customer environments, other customers may not have faced this same problem, although they have similar configurations or situations.

There are other situations where a fix could have been in a common code path. Hence customers with different configurations of the product may hit the same problem. But this may not be obvious by reading the related fix description.

Another important requirement for such systems is that only relevant fixes are applied on

production servers. Reason being, on the one side, fixes usually are interim in nature and there is never enough time to do a complete test (such as an SVT) on such fixes. Hence, updating a

production server with all available fixes may cause introduction of other side effects - failures which were not seen before. Hence the decision on whether to apply a fix or not is an important one and taken with utmost care when it comes to live servers.


As discussed above, it would be great if the bug never occurs - i.e the...