Efficient Pattern Matching in Long Strings

OData support
Dr. Juhász Sándor
Department of Automation and Applied Informatics

Suffix arrays and suffix trees are well-known for their capability of efficiently solving different string processing problems including exact string matching, which has many uses in a variety of fields like computational molecular biology and search engines. In this thesis I present a way to use hash tables for exact string matching as well as my detailed comparison of the different approaches, throughout carefully selected test suites ranging from proteins to English texts. As an improvement to the basic construction algorithm I propose a worst-case O(n) method and the idea of directly storing the sole suffixes in the table. To make the pattern matching more efficient I suggest improvements to the basic matching algorithm and an idea to compress similar consecutive suffixes. The experimental results show that in many areas the hash table based version outperforms even the best known suffix array and suffix tree based solutions, thus indicate that this approach is not only of theoretical interest.


Please sign in to download the files of this thesis.