Clustering Based Classification Requiring Minimal Labeled Data

OData support
Dr. Szűcs Gábor
Department of Telecommunications and Media Informatics

In many real-world data mining problem it is hard to obtain labeled instances. The obtaining can be expensive, time-consuming or maybe the required data is not available. In these cases there is just a small labeled data set. The supervised learning methods are trained with labeled data. When the labeled data set is not big enough they often perform poor results. Nevertheless semi-supervised learning algorithms can learn from both labeled and unlabeled instances. The clustering based classification is a semi-supervised learning technique which first clusters both the labeled and unlabeled data with the guidance of labeled instances and after that classifies the data set. Active learning observes if the learning algorithms can choose the labeled instances which from they learn then they may achieve more accurate results with fewer instances. In my thesis I examined clustering based classification and active learning methods together because they approach the same problem just from different directions. I planned and implemented a clustering based classification system with active learning and tested it on a given data set. According to my method if there is just a small labeled data set that can be obtained then use a semi-supervised learning technique and query the instances that are the most informative for the classification algorithm. I compared the results of the system with the results generated by supervised learning and clustering based classification. I calculated the savings and plus incomes originated from the usage of the implemented system with cost-benefit analysis.


Please sign in to download the files of this thesis.