Semi supervised learning techniques

OData support
Gáspár Csaba
Department of Telecommunications and Media Informatics

Nowadays it is possible to record and process huge amount of data due to technical development. There is an increasing need for data mining to reveal hidden features and discover correspondences of data. One group of these algorithms, supervised learning techniques is capable of solving the following problem: predicting the target attribute using a pre-taught model. For example in the finance sector it is a task within the process of offering credit to predict whether the client will able to pay the credit back or not.

Pre-teaching such models requires an appropriate amount of data which is well structured in order to achieve high performance. Semi supervised learning algorithms can be used in cases when the target variable is missing or it is not worth accessing it.

In my thesis my goal was to build a semi supervised learning system, which can handle the problem of missing values of the target variable efficiently. I designed the process in a way that it should use existing supervised data mining models and afterwards in an iteration cycle process and feedback the results which would gradually approximate the best available performance. During the processing I used graph-based representation of the data with missing values in the target variable. I implemented the algorithm with RapidMiner data mining software and with the help of Java technologies.

The implemented algorithm was evaluated with a test system built for this purpose. I used financial, insurance and scientific databases and wide range of input parameters to optimize the performance of the procedure. On the other hand I evaluated a common supervised learning algorithm under the same circumstances. Finally I compared the result of the two models.

Semi supervised algorithms performed better, when there was only a small amount of data for teaching and the rate of missing values at the target variable was high. In this case I reached a 0.032 AUC increasing on average on the tested databases. I defined the improvement orientation of the algorithm by applying these results and consequences.


Please sign in to download the files of this thesis.