Scalable methods for measuring data mining model performance

OData support
Supervisor:
Prekopcsák Zoltán
Department of Telecommunications and Media Informatics

The measurement of model performance is an essential step for conducting a successful data analysis project. These metrics make the quality level of these projects be comparable to each other, letting us choose the right alternative, should it be the most appropriate model or the optimal parametrization. They also make it easier to communicate all the decisions made during the elaboration of the data mining project.

Calculating some of these performance indicators can be a hard work though given their algorithmic complexity. In many cases the root cause of the problem is the data growth, as conventional algorithms tend to fail handling large data sets. The lack of alternative tool kits that would return the targeted results within a tolerable amount of time makes this issue even more worrying. In my current work I aim to present the issue of binary classification and the performance measurement of classifier models, in particular the Area Under Curve metric and its calculation algorithms. These are suitable for evaluating models built on extremely large data sets and within acceptable execution time, provided that the eligible computational architecture is available.

The implementations of the above mentioned approaches were written in Python language for a single-node environment, and some of them are even applied to clustered computation using a corresponding member of the Hadoop ecosystem. I also gave a detailed explanation of the measured results compared to each other. My final goal was to provide a rich set of alternative solutions that the most appropriate approach can always be easily selected from.

Downloads

Please sign in to download the files of this thesis.