Development of a Knowledge-based Text Mining System

OData support
Dr. Mészáros Tamás Csaba
Department of Measurement and Information Systems

Digital humanities is an interdisciplinary field of study on the frontiers of computer science and the humanities. Topics like the creation and management of digital archives, computer aided analysis of artistic artifacts or the tracking and forecast of cultural trends all belong to its dominion. When interpreted more narrowly – as the scientific backstage of my work – it is about the analysis of digitized, written, historical texts, as well as about the construction of corpora of such works.

The aim of my thesis was the design and implementation of a system capable of becoming a tool for digital humanities research: a knowledge based text mining system for historical works with the use of domain specific knowledge.

Within the context of the system a textual work means a whole range of inputs: the source text itself, the authorial dictionary, critical notes associated with the text and the list of named entities present within. From these inputs a model is created which enables formalized knowledge to be attached to the work in question: to this end, provided domain specific knowledge is utilized (e.g. form variants or parts of speech of articles in the authorial dictionary, critical notes associated with given text segments) as well as external knowledge sources like DBpedia, from which extraction of facts about named entities (places and persons) is attempted.

Since the created model contains collateral knowledge about the text, methods of analysis otherwise not possible can be applied to it: dictionary based text normalization, part-of-speech tagging of historical texts, query expansion through critical annotations or the use of person or place related knowledge during filtering of model entities. Parallel to this the system also supports execution of statistical (e.g. stylometric) queries.

In my thesis I will present the analytical capabilities of the implemented system, the related design choices, as well as its functional components. Kelemen Mikes's Letters from Turkey are used as an example throughout the document, expanded with the critical annotations of literary historian Lajos Hopp. The stylometric capabilities are presented on a corpus assembled from the most notable representatives of Hungarian baroque literature, Kelemen Mikes among them.

During the formulation of the system's feature requirements I recieved help from the researchers at the Institute for Literary Studies of the Hungarian Academy of Sciences located at the Research Center for the Humanities, as well as from my supervisor, Dr. Tamás Mészáros, who maintained contact with them. Collateral data related to the Letters were also provided by the Institute's associates.


Please sign in to download the files of this thesis.