In this document I explore the possibility of classification of documents by their author.
First, I introduce the basics of datamining and specifically the text mining, starting with clustering. First, demonstrating the probability based clustering, then present an often used data model, the topic-model and an algorithm that uses this model, known as Latent Dirichlet Allocation.
After that I describe what classification is, what the difference between clustering and classification is. Introduce one of the most used text classifier algorithm, the Naive Bayes algorithm. Show its model and functionality. Present an iterative technique for improving the model and the precision of the Naive Bayes algorithm, called Expectation-Maximization, which uses not just the labelled data for creating its probability model, but makes use of the unlabelled data as well.
Finally, I introduce the basic problems during text classification and demonstrate the impact of the different solutions on those problems. Mostly focusing on the frequency-based selection, the size of the training set and the usage of semantic fields. In this example the data is made up by online news articles.