Improving Heuristic Correlation using Time-based Algorithms to Predict Relevance
Original Publication Date: 2004-Jun-08
Included in the Prior Art Database: 2004-Jun-08
This article describes a set of algorithms and procedures that improve the ability to heuristically correlate information between multiple software systems. Heuristic correlation uses implied relationships between the data exchanged between products to correlate the processing of the exchanged data. This idea focuses on making the correlation as deterministic as possible without requiring shared or unique identifiers within the data exchanged.
Improving Heuristic Correlation using Time -based Algorithms to Predict Relevance
Heuristic correlation uses implied relationships within the data contained in the requests exchanged between products, and typically is not deterministic (i.e. unique in space and time), because the data itself is not unique. For example, a system can correlate the HTTP requests sent from a browser to a web server using the URL contained in the request, except that typically a web server is receiving multiple requests from multiple browsers, at multiple times, for the same URL. Therefore the request is neither unique in space (multiple browsers can use the same URL) or time (the same browser can request the same URL many times during the day).
A typical interaction between two software products when processing a request is shown below:
Most existing correlation systems require the products to assign unique identifiers to all requests exchanged between the products, and capture those identifiers when logging the send and receive events. These systems assume the system has the ability to assign unique identifiers to requests, the software products that make up the system have the ability to include and recognize these identifiers in the requests that are sent and received, and the software products capture the identifiers when logging the operations performed while processing that request. Most existing products either do not insert unique identifiers in requests, or are inconsistent in how they handle the identifiers. Furthermore, many of the interfaces between products limit (or prevent) the modification of data exchanged between the products, including the addition of any unique identifiers.
The challenge is correlating the sending of the request (by Product A) with the reception of the request (by Product B), i.e. correlating the request received event captured by Product B with the request sent event captured by Product A. The cornerstone of the proposal is an algorithm that uses time stamps and temporal estimations to estimate when a request sent by a product is received by another product, using the equation Trcv-range = Tsent + Ttravel-range + Tskew. This equation provides a reasonable estimate of the time window in which the sent request should be received.
Product A Product B
(request sent) (information)
Logged event (request rcv'ed) (information)
The following methodology can then be used to narrow the list of possible request received events down to the event that most probably corresponds to the desired request sent event:
Narrow the list of possible events to only those captured in the specified time window, using
the equation Trcv-range = Tsent + Ttravel-range + Tskew. Apply heuristic algorithms on the request information captured in the events to determine
which events represent requests containing the same data (e.g. only choose tho...