Analysing texts is one of the most important problems recently. The applications are widespread, the rapidly increasing amount of texts written in natural languages makes them essential.
Stilometry is a very important, increasingly popular field within text analysis that focuses on style-based categorization of texts. This method is well-suited for finding the author of a text, but it can be used in a variety of other ways. The main problem is, however, that choosing the algorithm and the parameterization for the analysis’s is a very difficult problem, and the tools provided are, excluding a few exceptions, not very well suited for researchers of non-IT fields as they require installation and programming knowledge.
I have implemented a software that solves both problems. I have expanded upon an already existing library. The software I implemented lets the user upload texts into a database server through a web-client, and gives the option to conduct experiments on them, where the calculations are done on the server. Using it does not require any installation.
I have also given a solution to the problem of parameterization. I have developed two methods, that can be used building on each other’s results. One of them is a heuristic analyser that tries to guess the optimal parameters from various properties of the corpus. The second method is a more sophisticated one, but a lot slower. The technique it is based on is a well-known widely used method, utilising a local searching algorithm.
I have created a wizard that helps the user by leading them through all the parameters step by step, and it also includes tips and explanations for them. It also lets the user export and import the parameters set for the analysis for later reuse.
I have conducted an experiment where a user who has little experience with stylometry and the tools used tried to set the parameters by hand and compared it to the software’s performance. I also conducted an experiment where a published experiment by one of the leading researchers of the field served as the comparison point, and the software was able to get close results.