Data mining techniques in Hadoop environment

OData support
Dr. Ekler Péter
Department of Automation and Applied Informatics

In my thesis I have implemented clustering and outlier detection algorithms for Apache Hadoop, specifically k-Means and Local Outlier Factor (LOF). The primary goal was to achieve scalability, so we could use these applications to analyze datasets, which are difficult to handle using traditional methods, due to the size of the data and the computationally intensive nature of the algorithms. MapReduce LOF, Spark LOF and MapReduce k-Means programs performed well on the test data, indicating scalability. The correctness of the algorithm implementations has been verified, thus these can be regarded as generally reliable. The visualization of the results yielded the expected extra information provided by the examined algorithms for both clustering and outlier detection. Regarding the Big Data technologies in terms of the implemented data mining algorithms, used for the implementation, Apache MapReduce proved to be a very reliable and mature framework, while Apache Spark seemed to me as an easy to use Big Data software, which is lately developing very rapidly.


Please sign in to download the files of this thesis.