Web Analytics: HTTP Log Import Through EJBs
Original Publication Date: 2002-Sep-14
Included in the Prior Art Database: 2003-Jun-21
As the world wide web has increased in popularity, the amount of traffic to web sites is constantly growing. Numerous techniques have been adopted to cope with this increased traffic. Web sites have been optimized to reduce the amount of data transmitted. Caches have been developed to take some of the load off of web servers. Web servers themselves are being load balanced and clustered. What appears to an end user as one web site may be tens or even hundreds of machines that are needed to ensure that every visitor gets a prompt response. All of these techniques optimize the end user's experience on the web site, and high traffic web sites are perfectly capable of serving millions of hits per day. However, this amount of traffic generates a vast amount of logged data about visitors, and traditional methods of web analytics are becoming unable to cope with it all. The traditional method for processing HTTP log files is to process each file from start to end. A single unit of work (thread or process) generally handles the entire file. This unit of work (thread or process) opens the file and begins reading, processing the file's data as it goes along. From an analysis viewpoint, there is a relationship between records a session. Processing a file from start to end within a single unit of work allows the analysis process to easily accumulate records into a session. The process handling the log file is able to process log data at a fixed rate. In order to increase the amount of data processed in a given time, faster hardware is required. However, load balancing can increase the amount of log data to be processed far beyond the capabilities of the fastest hardware there is just no way a single machine can keep up with the data generated by a cluster or tens or hundreds of machines.