In recent years the scope and amount of data collected by companies has significantly increased. Handling these Big Data necessitated the development of new storage and data analysis methods. Clustering is one commonly and widely used tool for data analysis. With the application of classical clustering algorithms, we do not use any kind of prior knowledge. However, in real cases some a priori information is often available, and this information can be used to improve the clustering result. This advantage is used by the semi-supervised clustering algorithms.
One of the most significant platform for Big Data is the Apache Hadoop framework, which provides almost unlimited scalability. On this platform Apache Spark is one of the fastest growing programming environment.
The aim of my thesis is the implementation and validation of a semi-supervised clustering algorithm. During the implementation of the algorithm the most important aspect is the scalability.
In my thesis after the theoretical overview of clustering and semi-supervised clustering, and the description of some specific clustering algorithm, I briefly introduce the Spark technology and the Scala language. Then I explain the details of the algorithm implementation, and the performed experiments in order to validate the performance of the algorithm.