Qualifying cluster analysis methods applied to scriptinformatic database

OData support
Dr. Hosszú Gábor
Department of Electron Devices

Data Mining is a powerful tool used to discover meaningful patterns in large data sets, using methods of machine learning, statistics and database systems. The author used the Data Mining tools with Scripinformatics data set. Scriptinformatics, as a field of computer science deals with evolution of various writing systems. Particularly, Cluster Analysis is used to generate clusters within data set, to form relations between different scripts. Scriptinformatics data set is imported in software MATLAB, and iterative implementation of essential types of Cluster Analysis algorithms with different Cluster Analysis methods is conducted. The obtained results give very significant properties associated to connections between different scripts. Considering their distances and similarities between each other, the procedure formed clusters, in which most similar scripts were grouped together in a same cluster. For our data set, formation of two or three clusters was found to be the most optimal case. Later on, the results were validated by assorted methods and experimentations, such as Elbow method and Silhouette method which, are used to affirm the performed clustering process. Both the methods approved the aforementioned optimal number of clusters to be formed. Finally, the quality of the results was ratified by the method of Cophenetic Correlation Coefficient, which is most widely used method, which validates the quality of the performed clustering process.


Please sign in to download the files of this thesis.