Browse Prior Art Database

Web Analytics: HTTP Log Import Through EJBs

IP.com Disclosure Number: IPCOM000016219D
Original Publication Date: 2002-Sep-14
Included in the Prior Art Database: 2003-Jun-21
Document File: 2 page(s) / 42K

Publishing Venue

IBM

Abstract

As the world wide web has increased in popularity, the amount of traffic to web sites is constantly growing. Numerous techniques have been adopted to cope with this increased traffic. Web sites have been optimized to reduce the amount of data transmitted. Caches have been developed to take some of the load off of web servers. Web servers themselves are being load balanced and clustered. What appears to an end user as one web site may be tens or even hundreds of machines that are needed to ensure that every visitor gets a prompt response. All of these techniques optimize the end user's experience on the web site, and high traffic web sites are perfectly capable of serving millions of hits per day. However, this amount of traffic generates a vast amount of logged data about visitors, and traditional methods of web analytics are becoming unable to cope with it all. The traditional method for processing HTTP log files is to process each file from start to end. A single unit of work (thread or process) generally handles the entire file. This unit of work (thread or process) opens the file and begins reading, processing the file's data as it goes along. From an analysis viewpoint, there is a relationship between records a session. Processing a file from start to end within a single unit of work allows the analysis process to easily accumulate records into a session. The process handling the log file is able to process log data at a fixed rate. In order to increase the amount of data processed in a given time, faster hardware is required. However, load balancing can increase the amount of log data to be processed far beyond the capabilities of the fastest hardware there is just no way a single machine can keep up with the data generated by a cluster or tens or hundreds of machines.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Web Analytics: HTTP Log Import Through EJBs

      As the world wide web has increased in popularity, the amount of traffic to web sites is constantly growing. Numerous techniques have been adopted to cope with this increased traffic. Web sites have been optimized to reduce the amount of data transmitted. Caches have been developed to take some of the load off of web servers. Web servers themselves are being load balanced and clustered. What appears to an end user as one web site may be tens or even hundreds of machines that are needed to ensure that every visitor gets a prompt response.

      All of these techniques optimize the end user's experience on the web site, and high traffic web sites are perfectly capable of serving millions of hits per day. However, this amount of traffic generates a vast amount of logged data about visitors, and traditional methods of web analytics are becoming unable to cope with it all.

      The traditional method for processing HTTP log files is to process each file from start to end. A single unit of work (thread or process) generally handles the entire file. This unit of work (thread or process) opens the file and begins reading, processing the file's data as it goes along. From an analysis viewpoint, there is a relationship between records - a session. Processing a file from start to end within a single unit of work allows the analysis process to easily accumulate records into a session.

      The process handling the log file is able to process log data at a fixed rate. In order to increase the amount of data processed in a given time, faster hardware is required. However, load balancing can increase the amount of log data to be processed far beyond the capabilities of the fastest hardware - there is just no way a single machine can keep up with the data generated by a cluster or tens or hundreds of machines.

      Instead of maximizing performance on one CPU, employ an alternate method of processing log files. Just as a cluster of web servers is responsible for serving http content, employ a cluster of machines to analyze the log data. As the amount of data to be analyzed increases the number of machines in the cluster can be increased to handle the additional data.

      To employ this method of log file processing, consider the process of a...