Browse Prior Art Database

Method and system for faster processing of very large semi structured files through methodical record sampling for dates

IP.com Disclosure Number: IPCOM000193231D
Original Publication Date: 2010-Feb-15
Included in the Prior Art Database: 2010-Feb-15
Document File: 4 page(s) / 82K

Publishing Venue

IBM

Abstract

Problem determination processes may involve a variety of procedures and resources but one of the most commonly used is some form of log file analysis. A support engineer, more often than not, will need to know what the systems (software or hardware) in question have been doing up to the point of failure. A well serviceable system will generate log information that will allow the engineer to trace steps up till failure and hopefully identify a cause and solution. When the amount of log information (usually contained within a text file) is small, the engineer may open the file in a text viewer or editor and locate critical log messages and keywords that identify what caused a failure. With their experience, an engineer will then suggest a solution. There are tools available that can assist in this process of identifying problem indicators or symptoms within a log file and can also provide a solution. The first method (e.g. GREP) is a tedious and a "hit and miss" approach to problem determination where the engineer would go through a list of known problem keywords one by one searching for them in the log file. Of course, experienced engineers would be more efficient at locating the problem. The second approach, i.e. using a log analysis tool that provides some levels of assistance helps overcome much of the repetitive tasks an engineer must carry out in the problem determination process. When there are very large amounts of log information that need to analyzed, both of these processes become difficult to carry out. Even problem determination tools have memory and CPU limitations that become increasingly problematic as log information continues to grow in size. What we propose here is a method to overcome problems that are encountered while performing problem determination with large log files. It is important to state that while the problem mentioned and examples given pertain to the processing of log information, the methods covered in this document can be applied to any source of chronologically ordered semi-structured information.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 4

Method and system for faster processing of very large semi structured files through methodical record sampling for dates

Disclosed is a method and system that allows fast processing of semi structured files using record sampling for dates.

When a log file is picked up for processing by a tool, the first step is to determine an appropriate chunk size. The way that this is calculated is left to implementation chiefly because it will vary drastically from system to system. For example, for a java tool, an appropriate size of chunk could be calculated as anything between 5 and 10

percent of the available heap memory.

Once this has been calculated, then we need to locate the points at which the file needs to be split.

For example in a 45 megabyte file and a maximum chunk size of 10 megabytes, the file becomes 5 chunks with the last chunk at 5 megabytes. We need to traverse to the end of first chunk i.e. seek forward 10 megabytes in the file. From this point onwards, the start position of the next record needs to be located. If the record starts with a date, then the date-time format predictor can be used locate the next date of known or even an unknown format. Otherwise a regular expression could also be used to locate the start. In this way the entire file is broken up into smaller and more manageable chunks.

1

Page 2 of 4

How the chunks are realized depends upon implementation, one way would be to write the information from each chunk into a new file. Once the file has been broken into chunks, we can filter the content based on date.

Filtering based on date is essentially creating a time window (start and end date) between which we want information from the file.

2

[This page contains 1 picture or other non-text object]

Page 3 of 4

The steps below can be followed to realize this invention.

Filter File Content


Input to this is a start date and an end date. The two dates represent the time window within which we want to retrieve information from the file.

Read the file one line at a time
While more data is available

Read a line from the file


Use the date time predictor to locate a date on the current line If a date is found, assign this to file time window start date

  Additionally track a window start file pointer as 0
If a time window start date is not found, then exit
Take a portion from the end of the file (e.g. 10 kb)

Read from this portion one line at a time
While more data is available

Read a line from the file


Use the date time predictor to locate a date on the current...