The Faculty of Electrical Engineering and Informatics (VIK) at the Budapest University of Technology and Economics (BME) made all thesis works public since 2010. Besides the free information access this also opened the gate for the automated processing of documents.
As a two-person team we have created a solution that is able to efficiently process the BME VIK dataset and analyze the similarities in the different thesis works. The solution is also able to compare these thesis works to different well known external sources, which we demonstrate through comparing our data to the Wikipedia dataset.
In this thesis I show the functional principles of our framework and algorithms. To allow the timely processing of large text databases we have used several special implementation techniques which I will also cover in this paper. Using our algorithm, we have already identified several suspicious overlaps between existing thesis works. By extending the circle of reference documents, we will be able to get a reliable plagiarism detector for the purposes of higher education.