Design of General Message-Passing Watchdog Process
Original Publication Date: 2005-Oct-06
Included in the Prior Art Database: 2005-Oct-06
As the client/server architecture of software systems proliferate, there is an increasing need for each component to know the state and availability of other components. There is currently no good way to do so, because generally, the conventional approach is that it is initiator driven--that is, whenever the need arises to find the state of another entity, the initiating entity needs to probe. For example, when the Resource Manager tries to migrate 10,000 documents (or batches of documents, whatever the unit is) to Tivoli Storage Manager (TSM), it probes TSM to see if it's available 10,000 times. As you can see, this is impractical. For example, if TSM does crash, worse case is that a timeout occurs during the file transfer. If this scenario happens, a timeout would likely occur sooner than when 10,000 probes are completed and also takes up less resources than the probing of TSM 10,000 times. The other, more unconventional, approach is "destination" driven, where the destination machine that the file transfer is supposed to go (like TSM) notifies the source machine (Resource Manager) of its state. This has one big inherent flaw, which is that if TSM is down, it cannot notify the source machine that it is down. The remedy to this is to repeatedly notify the source machine that it is available. In which case, if the destination machine's process crashes, the source machine does not get pinged, and the source machine can know that the client is not available. But this means polling, which is inherently bad and can lead to resource (network, cpu, etc) issues, especially if there is a large number of clients. With such a downside, this is why this approach is not popular and is unconventional. Another approach is to do nothing. Assume the machine you are trying to connect to is available at all times and do the file transfer. If in fact the machine is down, a timeout will occur. The downside of this approach is that users will have to wait until the timeout occurs, which would take on the appearance of a hang. Moreover, to initiate a file transfer (that we know is doomed from the start) is pointless and wastes resources. Consider a user who needs to import 10 files. For each file the user will have to wait the whole timeout value (say, 5 minutes), to discover that the resource manager is down. Thus, 50 minutes will be gone before he realizes that 0 of the files can be imported. Moreover, imagine a server needing to send files to multiple clients and assumes all of them is up. This is thus another bad solution. Another more familiar approach is that of the library server monitor process. This approach is essentially the same as the first one I described, except it does the polling periodically instead of when needed. Moreover, this monitor process is located in one central server and probes all Resource Managers. This design works, but is not optimal. The library server where this process is located is a bottleneck by definition. All clients needs to connect to it, which leads to resource constraints on the machine. So running another essential process is not a good idea--especially because it is not lightweight; this process needs to connect to multiple Resource Managers (suppose there are, say, 50), then makes a db2 connection for each to update its status. If this is the top down approach, ours is the bottom up approach, which is better, as we shall explain in the later sections.
Design of General Message-Passing Watchdog Process
Our invention is an approach of the communication of "states" between entities. One of our objectives is to make it as generic as possible so it will apply to any other product. What we will utilize is a watchdog process, its implementation is trivial, using java, c++, or any other language. It is placed on the machine that needs to be monitored, takes as input a list of processes to monitor, as well as any specific information that the process needs to ensure that the process is accepting requests, such as port number, connection information, etc.
Then periodically (configurable by user), it will check to see if the process(es) to be monitored is up, either by checking to see if the process is running, or testing by making an actual connection to the process. In this process, it can even gather statistics, such as network throughput, etc. If and only if the state of this particular process changes, the watchdog sends a message to the remote machine that needs the process, such that the message can be retrieved whenever the remote process needs. And all previous messages received about the monitored process on the remote machine can be purged periodically to save space. The message could be in the form of an email, file, etc, such that the remote machine can retrieve it whenever it wants. This approach is good, because the remote machine only acts when it needs to, whereby freeing it from the burden of having to respond to all machines it is connected to whenever it is pinged (polling). So for example, suppose I have a content management system with one central library server, and 50 remote resource managers. I place this generic watchdog process on each of the remote resource managers. Every 5 minutes, each watchdog process checks for change in status of the Resource Manager. If the state changes, it sends an email (or whatever) to the library server machine notifying it of the RM's latest state/statistics. Then, only if the library server needs access to a particular RM, say RM 23, does the library server check the email from RM 23. As you can see, in this approach, the communication cost on each RM as well as on...