A classic data mining task is to analyze and classify texts based on which topic it is on. The ever growing textual information of the Internet features billions of websites that could be the source of such classification. The student’s task will be to deal build a large scale classifier that is able to identify websites based on textual or other data that is incorporated on the website or other sources.
The goal of this thesis to build a multiclass classifier that is able to identify websites based on textual. The first part of my thesis I present the pre-processing steps, such as data collection, data processing and data cleaning. At the end of this section get ready the database. In the second part of the thesis I demonstrate some classifier algorithms, like Naive Bayes, Random Forest, Gradient Boosting and Support Vector Machine. I also present some measure which can measure the efficiency of classifier methods. Finally, I use the classifiers on my completed database and I evaluate the results. I select the algorithm, which give the best result for this task.