The importance of print media is becoming less and less significant in the world, while news websites attract millions of visitors on a daily basis. This phenomenon can also be observed in Hungary: according to the estimated rankings, Hungarian news web portals are among the most visited websites in my country.
There is a wide range of Hungarian news websites, thus the news articles found on these sites can differ in several aspects: the extent of elaboration of the topic, the language and style of the article, and also the political view. It might be interesting to compare how a specific news article is published on different news websites, however it would not be effective to manually search articles which are about the same subject.
My thesis aims to implement a project which extracts news articles from Hungarian news websites, then finds similar articles which feature the same topic by analyzing their content, and finally presents them for the user to be compared in the browser. I have developed a complex system to solve these problems: I describe the architecture, modules and the details of its behaviour in my thesis.