Solving classification problems in a distributed environment

OData support
Supervisor:
Prekopcsák Zoltán
Department of Telecommunications and Media Informatics

With the increase in the amount of data available in the world, data sets that need to be analyzed are growing rapidly. Computing and storage capacities are having a difficulty keeping up with the pace of growth of data. In recent years the demand for analyzing these data sets has grown rapidly, but either current tools like RapidMiner, SAS and IBM SPSS Modeler cannot handle them or workarounds are needed. For tackling these problems many distributed computing frameworks have been developed, but the limitations and possible applications of these are not yet known.

The thesis aims to evaluate the classification algorithms of the Apache Mahout framework in aspects of current state and data mining applications. In the course of my work I performed research to explore currently used classification algorithms and the Hadoop distributed system. I designed and developed an easy to use test environment to evaluate some of the classification algorithms. I performed measurements with the test environment created to compare the selected Mahout algorithms to the ones already present in RapidMiner.

Downloads

Please sign in to download the files of this thesis.