Data representation for multimedia classification by unsupervised learning

OData support
Dr. Szűcs Gábor
Department of Telecommunications and Media Informatics

This essay focuses on the dimensionality reduction techniques for real datasets originated from multimedia data. The aim of my task was to reduce the huge number of features of these datasets. That can increase the efficiency of classification of unlabeled data (assuming a linear classifier).

Reduction methods were processed by the softwares of SPSS Clementine 12 and Weka 3.6. The measurement of ranking the predictions used the scoring of Area under the Learning Curve (based on the Area Under the ROC Curve). The sources for testing the efficiency of the different methods were provided by the Unsupervised and Transfer Learning Challenge program. That means data sets from various domains such as handwriting recognition, text processing, video processing, ecology, image recognition with features from 100 to 47236.

Different extraction methods of factor analysis were used to analyse the efficiency of their evaluation on the data available. I was examining how they behaved with variant features. Moreover, I studied the result of leaving all the zero features and the effect of normalization.

My essay provides a good overview of the data dimensionality reduction methods highlighting their incompleteness also. On these pages I will show you how important this field of proper data preparation is nowadays when billions of bytes are appearing each day and we have to gain valuable knowledge from it.


Please sign in to download the files of this thesis.