Browse Prior Art Database

Web Analytics: Avoiding Redundancy in Analysis of Log Data Disclosure Number: IPCOM000016218D
Original Publication Date: 2002-Sep-14
Included in the Prior Art Database: 2003-Jun-21

Publishing Venue



For accurate web traffic analysis, it is essential to avoid redundant data collection. Often, raw web traffic data is collected and analyzed by reading an HTTP server's log file. Redundant data can occur, for example, by importing multiple copies of an HTTP log file, or importing the same HTTP log file multiple times. Often, HTTP servers will construct these log files over time, appending records to the bottom of each file. Suppose you collected and analyzed data from an HTTP file at one point in time, then later after more records had been appended to the file, you wanted to collect and analyze from the file again, picking up only the new records. During this data collection and analysis process, you do not want to duplicate the data from the first part of the file. An alternative source location for HTTP data are database tables. This occurs when a web server logs raw web traffic data to a set of database tables. In this case the data collection and analysis process reads its raw data from the database, rather than from HTTP log files. As web traffic continues on the web site, new records are constantly being added to the same set of database tables. The same redundancy problem exists, namely, when re-analyzing the same set of database tables, you do not want re-analyze records that you analyzed before. You want to only analyze new records. The following algorithms describe one method to avoid data redundancy with minimum impact on performance. This method can be applied to analyzing web traffic data, or generally used when reading and processing any type of data file or database input. For data import from files: