Comparison of distributed decision tree algorithms in Apache Spark

OData support
Prekopcsák Zoltán
Department of Telecommunications and Media Informatics

Our world is more and more dependent on digitalized information, we store big quantities and want to process it fast and precisely. To solve the problems emerging with these requirements, we need Big Data technology, which gains its popularity nowadays. Big Data tools are engineered to process information in a distributed manner and their applications are of a wide variety, including distributed classification problems. Binary classification of data can be achieved with Decision Tree Algorithms, a machine learning algorithm. Using two distinct version of the same library of Decision Tree Algorithms can provide different results with different precision and run time. In a production environment, where the least amount of time difference can make big differences, it’s crucial to select the most optimal one.

One of the most popular Big Data toolset is Apache Spark. Spark has its own machine learning algorithm library and also optionally can use H2O, an external library to deal with machine learning. The main goal of my thesis is to compare these algorithms by precision, scalability and run times. I designed and implemented a framework in Scala which can build different models, using externally given parameters and save measurements into a file, available for processing.

In this document, I compare Spark’s built in Decision Tree algorithm (Gradient Boosted Tree) with the H2O implementation of the same algorithm. Using this, the reader can decide which algorithm to use in a classification problem. From the measurements, using the presented parameters and data sets, H2O proved faster and more precise against Spark’s built-ins.


Please sign in to download the files of this thesis.