Various tools are available to process natural language texts, but only a very few of them are capable to properly scale out, when processsing large datasets.
The thesis work was elaborated as a part of a project aiming to develop an intelligent content creator software. My task was to process the whole article base of Hungarian and English Wikipedia, as a corpus. During my work I gathered knowledge about language categorization, lemmatization, part-of-speech tagging, tf-idf measures, and cooccurence (collocation).
As a solution I developed a data processing toolchain using UIMA framework, which includes: Wiki markup parsing, lemmatization, part-of-speech tagging (both Hungarian and English languages), and stores the results of analysis in indices provided by Apache Lucene.
The contructed UIMA CPE processing chains were deployed to production servers, and their performance was measured with a multi-threaded processing job.
The indices are constructed during a batch processing job, based on a periodic database dumps made by Wikimedia Foundation. This is a one time, but lengthy operation. To keep the data in indices up to date, the toolchain incorporates a reader, which is capable of reading article notifications directly from the IRC channels, where the modified articles are automatically announced by the Wikipedia servers. Using this information the reader retrieves the modified articles from the Wikipedia API.
Indexing the results provides fast searching capability. Also in indices it is possible to express tf-idf measures of words, multiword expressions. Furthermore, with stored term vectors, it is possible to provide "more like this" document search. The more like this document function is used help the editors, by offering category and template names from similar documents as a probably good match for the currently edited article.