Implementations for approximation algorithms of the shortest common superstring problem

OData support
Dr. Csima Judit
Department of Computer Science and Information Theory

DNA sequencing is an important field of bioinformatics. In order of efficient sequencing, the new generation sequencing techniques make use of precise lab procedures and the ever increasing computing power of the modern computers.

One of the most prominent DNA sequencing method is the so-called shotgun sequencing. Sequencing a long DNA molecule is prone to errors, so the main idea behind shotgun sequencing is to break the molecule into smaller, readable fragments. Determining the original sequence from these fragments is called the shortest common superstring problem, which is an NP-hard problem, so approximation algorithms are used for solving.

The greedy algorithm and the so-called cycle-cover algorithm are two of those approximations. Both of these algorithms require the suffix-prefix lengths for all fragment pairs, whose naive implementation is inefficient, so the preprocessing of the strings and the usage of special data structures are needed, like suffix tree, suffix array and prefix tree.

The aim of this thesis is to present an overview and an implementation of these approximation algorithms with the aforementioned data structures, along with a comparison of their effectiveness.


Please sign in to download the files of this thesis.