Browse Prior Art Database

Technique for iterative refinement of parsing for semi structured logs

IP.com Disclosure Number: IPCOM000232137D
Publication Date: 2013-Oct-21
Document File: 4 page(s) / 98K

Publishing Venue

The IP.com Prior Art Database

Abstract

This disclosure describes a technique that builds domain-specific parsers for semi-structured log data by using an off-the-shelf tokenizer and a library of concept annotators. The proposed technique is generic and extensible – it can be applied to any scenario, but can be improved by progressively adding more domain knowledge in the form of custom concept annotators, as and when they become available.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 42% of the total text.

Page 01 of 4

Technique for iterative refinement of parsing for semi structured logs

In a distributed computing system, logs are collected from a large number of end points. A log search & analysis system enables these logs to be parsed, annotated, and indexed in an efficient data store and used for troubleshooting and root cause analysis. When application or infrastructure problems occur, these systems provide a single interface for IT administrators or application support engineers to search across all the different logs that have been ingested into the system.

Parsing is the process of breaking the input (stream of) text, in this case application and infrastructure logs, into meaningful groups/blocks of characters which could be words, phrases etc. as determined by the semantics of the domain / application. For semi-structured log data, parsing can also extract attribute-value pairs according to the domain or application specific semantics. This is an important task in the overall processing of the text, and the output of this stage feeds to the subsequent stages like mining of the text and providing faceted search and navigation over the log data.

Grouping to tokens can be done using an out-of-the-box tokenizer. However, often, off-the-shelf tokenizers are very generic and rely on heuristics to determine the tokens. For example, all contiguous strings of alphanumeric characters are grouped into a single token. All non-alphanumeric characters like punctuation marks, whitespace etc. act as delimiters.

This approach can cause problems downstream especially if the tokenizer breaks a composite entry into multiple tokens and discards the intervening delimiters (which is their usual behavior). It then becomes impossible to construct the composite entry back from the multiple tokens at a later stage. Thus, it is important to intelligently tokenize.

Extracting attribute-value pairs is also not very straightforward. For example, consider the simple case of

appearing in the data, where

is a common separator like '=', ':' etc. When

and/or

is multi-word or has intervening delimiters, then a simple pattern based extractor will make mistakes. For example, consider the entry " Date = Thu, March 21, 2013 " within the input text. The aforementioned extractor will extract (attribute = Date, value = Thu), instead of (attribute = Date, value = Thu, March 21, 2013), since it has no way of knowing that the entire text in the RHS is together, a meaningful entity.

This disclosure describes a technique that builds domain-specific parsers for semi-structured log data by using an off-the-shelf tokenizer and a library of concept annotators. While it is possible to build custom parsers from scratch for each log source semantics and log format, this wouldn't be scalable as new log sources and formats get added. The proposed technique is generic and extensible - it can be applied to any scenario, but can be improved by progressively adding more domain knowledge in the fo...