Web Analytics: Using Data Throughput to Determine Cluster Capacity for Analyzing HTTP Log Records
Original Publication Date: 2002-Sep-14
Included in the Prior Art Database: 2003-Jun-21
Processing an HTTP log file can follow a traditional method of processing a file from start to end. In this case a single unit of work (thread or process) generally handles the entire file. However, distributing the processing work over multiple units of work (threads, process, EJBs) can be a more optimal method. To achieve this process, place the analytics logic that processes the log entries inside of EJBs, and rely on an application server's (such as IBM WebSphere Application Server) clustering to provide continual performance improvements. Read a log file and split it into chunks. Pass each chunk to one of the EJBs, where it is then processed and stored in the database. As more EJBs are made available through the application server clustering, faster performance can be achieved. Each log file chunk can be a single record, or a collection of records. For optimum performance, one must determine the number of EJBs to use. The optimal number of EJBs is based on the number of nodes in the application server cluster, the capacity of each node, the capacity of the database used for data storage, and the speed of communication between the nodes and the database. All of these factors are expensive to calculate, difficult to interpret, and would need to be continually updated to reflect the changing state of the system. Instead of trying to calculate the optimal number of EJBs, use data throughput to adjust the number of EJBs during processing. This concept applies to EJBs and threads alike, depending on the unit of work you use to process the log records. With either threads or EJBs, constantly adjust the optimal number of threads or EJBs during the life of the processing to account for misleading adjustments, for example, due to spikes in operating system memory usage, CPU utilization, etc.