Design and implementation of efficient, textual data processing algorithms and correlation methods

OData support
Dr. Varga Pál
Department of Telecommunications and Media Informatics

Unstructured logging is still dominant in today's IT systems. This approach leads to information wasting. Information presented in logs are barely utilized until efficient information extraction methods are developed. The aim of this thesis is to design and implement efficient log parsing algorithms, and to present a format that describes correlation relationships between different event types, hereby providing efficient data enrichment methods for logs.

At the beginning of this thesis the current log parsing and correlation technologies are investigated. Then, two suffix array and suffix tree based algorithms are presented and benchmarked. The results are compared with the performance of a reference implementation of a regular expression based parser. Furthermore, a log parsing library - which uses the previously mentioned algorithms - is designed and implemented. The thesis describes this library and also includes the definition of a format that is capable of representing relationships between event types. A reference implementation using this format is presented which collects related events into groups and generates artificial events when certain conditions are met.

The results showed that purely regular expression based parsers did not scale well: as the number of patterns increased, the parsers' performance linearly decreased. By contrast, the presented suffix array and suffix tree based parsers did not suffer from this weakness: their measured performance was hardly affected by the number of patterns. Some measurements even showed that the newly designed parser library was able to parse more than 1 million log messages in a second, which is 38 times faster than the analyzed regular expression based parser. To demonstrate the practical applicability, the parser library was integrated to syslog-ng, an open source logging application.

As a conclusion, the thesis demonstrates that pattern based log parsing can be very efficient. However, manual pattern creation is expensive, therefore, automated pattern generation algorithms are needed.


Please sign in to download the files of this thesis.