Browse Prior Art Database

Web Analytics: Avoiding Redundancy in Analysis of Log Data

IP.com Disclosure Number: IPCOM000016218D
Original Publication Date: 2002-Sep-14
Included in the Prior Art Database: 2003-Jun-21
Document File: 2 page(s) / 46K

Publishing Venue

IBM

Abstract

For accurate web traffic analysis, it is essential to avoid redundant data collection. Often, raw web traffic data is collected and analyzed by reading an HTTP server's log file. Redundant data can occur, for example, by importing multiple copies of an HTTP log file, or importing the same HTTP log file multiple times. Often, HTTP servers will construct these log files over time, appending records to the bottom of each file. Suppose you collected and analyzed data from an HTTP file at one point in time, then later after more records had been appended to the file, you wanted to collect and analyze from the file again, picking up only the new records. During this data collection and analysis process, you do not want to duplicate the data from the first part of the file.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 45% of the total text.

Page 1 of 2

Web Analytics: Avoiding Redundancy in Analysis of Log Data

    For accurate web traffic analysis, it is essential to avoid redundant data collection. Often, raw web traffic data is collected and analyzed by reading an HTTP server's log file. Redundant data can occur, for example, by importing multiple copies of an HTTP log file, or importing the same HTTP log file multiple times. Often, HTTP servers will construct these log files over time, appending records to the bottom of each file. Suppose you collected and analyzed data from an HTTP file at one point in time, then later after more records had been appended to the file, you wanted to collect and analyze from the file again, picking up only the new records. During this data collection and analysis process, you do not want to duplicate the data from the first part of the file.

An alternative source location for HTTP data are database tables. This occurs when a web server logs raw web traffic data to a set of database tables. In this case the data collection and analysis process reads its raw data from the database, rather than from HTTP log files. As web traffic continues on the web site, new records are constantly being added to the same set of database tables. The same redundancy problem exists, namely, when re-analyzing the same set of database tables, you do not want re-analyze records that you analyzed before. You want to only analyze new records.

The following algorithms describe one method to avoid data redundancy with minimum impact on performance. This method can be applied to analyzing web traffic data, or generally used when reading and processing any type of data file or database input.

For data import from files:

After a file is processed, save the line number of the last record processed. Next time a file is to be processed, check to see if the file has already been processed. If it has already been processed, retrieve the line number of the last record that was processed before. If the file currently contains more records than number of records processed in the last import of the file, process the new appended records.

This algorithm based on assumption:

Record are always appended to the end of the log file. The file is not manually modified. In the case of web traffic log files - multiple web servers don't generate the exact same log record.

For data import from a database:

After the database table is processed, save the date and time of the latest record processed (label it 'a'). Next time a database table is to be processed, check to see if the table has already been processed. If it has already been processed, retrieve the data/time of the latest record that was processed before (a). Get the date/time of the latest log record currently in the database table (label it 'b'). If (b) is later than (a), then import only records which have a later date/time than (a).

This algorithm is based on assumption:

Record in the database is not manually modified. Database table i...