Development of a graph analytics algorithm in Apache Spark

OData support
Prekopcsák Zoltán
Department of Telecommunications and Media Informatics

The aim of this thesis work is to provide a distributed solution to a graph problem, namely the maximum clique problem. Providing a scalable parallel implementation is a crucial goal, therefore the solution is based on the Apache Spark framework. Core functionality of Spark contains several elements which can be used as the basis of the implementation, while its graph processing component, GraphX supports development with graph specific functions.

There are several published algorithms dealing with the problem in question, which are thoroughly examined in this thesis. These include helper procedures beside exact and approximate solution algorithms.

Most of the presented algorithms are sequential, therefore the main objective of this work is to parallelize them in a way which is suitable for Spark. To achieve this, many alternatives are presented. The main solutions differ in the traversal of the search space and the distribution of data among worker nodes. In addition, minor optimizations are presented, which further improve the performance of the solutions. Part of these optimizations have theoretical background, which is independent of the concrete technology used, while others are Spark-related technological refinements.

Solutions are validated by performance tests, which were performed in a cluster. Running times were measured with input graphs differing in size and kind, which guarantees the comprehensive comparison of the solutions.

As it can be seen in this thesis, the architecture of Spark is not a perfect fit for the maximum clique problem. Nevertheless, it is suitable for the implementation of a scalable solution, which performs well on a wide range of input graphs.


Please sign in to download the files of this thesis.