Semi-supervised learning plays an important role in solving those data mining problems where the target variable, otherwise known as the label, is missing in the majority of the data we have at our disposal. Semi-supervised learning aims to exploit the known attributes of the unlabeled instances. Data mining literature documents several such semi-supervised methods, but as of yet the most widely used data mining software suites are not equipped with this functionality.
In the course of my work I performed research to acquire an understanding of the different semi-supervised learning methods, and selected two, Self-training and Co-training, for further investigation. I implemented the two methods in the Java programming language, and deployed them by extending the RapidMiner open source data mining software suite.
RapidMiner’s existing validation tools were not applicable for the evaluation of these methods, therefore I developed my own crossvalidation operator for purposes of measuring the efficiency of semi-supervised learning. The two learning methods were then tested on several freely available datasets, and were compared with a baseline method which does not use the unlabeled data.