Method for distributing work in a grid computing environment for better grid efficiency
Publication Date: 2010-Aug-26
The IP.com Prior Art Database
Described is a method for distributing work in a grid computing environment for better grid efficiency.
When running a volunteer computing grid such as the World Community Grid* project, one of the concerns is to use the donor computers efficiently as possible. That is, not to distribute redundant copies of the same work needlessly and to not have to throw away results returned because they have been computed much too late. Existing grid projects using software such as the MetaProcessor** from United Devices** (no longer offered) and BOINC (Berkeley Open Infrastructure for Network Computing - an open source software package), the exact same research work is typically sent to more than one member machine for two reasons. First, when multiple copies of the results are returned, they can be compared for equality, thus eliminating errant results or those that have been tampered with in some manner. For example, a public grid can have nearly 0.5% of its donor computers miscalculating scientific computations. Second, multiple copies are sent in case one of the member machines no longer is contributing, and thus there will be other machines that return a result in a timely manner. A longer wait time for the results is not desirable because of the large amount of extra storage required at the grid servers to keep track of the work, its input data, and results retrieved so far. This storage can add up to terabytes, and thus can be costly. Furthermore, many slow donor machines or machines that are not kept contributing a sufficient fraction of the time may never return their results within the deadline desired. Grid.org, which was run by United Devices, sent as many as ten copies of the same work out to donor machines in order to get prompt results which can be voted on for validity. This represents a huge (10x) loss of potential compute power in a grid. The World Community Grid project tries to keep this at two or three at most, but even this represents a large reduction of total compute power and thus slows down valuable research.
Many grid projects take the form of Monte Carlo simulations. That is, the solution space is so enormous, that it is impossible to exhaustively compute the optimal result. However, using Monte Carlo methods, many pseudo-random variations or choices are introduced during the simulation to get a wide scattered sampling of the solution space. Afterward, clustering methods are used to find the true or likely most optimal solution to the problem. Typically, a certain minimum number of such simulations is required for the clustering method to have a sufficient chance of finding a suitable, if not optimal, solution. This is accomplished by having many donor machines on the grid compute some number of simulations such that the entire grid ends up producing the total number of desired simulations. The clustering approach eliminates the need for sending out duplicates of the same work on the grid because presumably errant results will not form clusters in the solution space. Yet some sort of validation of the results may be...