Linguistic and Sematic Preprocessor for Open Access Scientific Papers

OData support
Kovács Ferenc
Department of Automation and Applied Informatics

The number of Open Access (OA) journals and articles has increased rapidly over the last decade. There was a tenfold rise between 1999 and 2009 and the trend remains the same.

The full text of OA journals and articles can be freely read and downloaded, as the publishing is funded through means other than subscriptions. This opens new horizons for using linguistic, statistical and machine learning techniques to process articles. For instance automated content analysis, trend analysis and sentiment analysis can be performed on more articles than ever before.

My software contributes to these goals by enriching JATS articles with the following Natural Language Processing (NLP) information: Tokens, Sentence Boundaries, Part-of-Speech tags and Named Entities (Person, Organisation and Location). The enriched JATS XML has a well-defined format suitable for further processing and conforms to the JATS format.

My thesis introduces the reader to NLP and gives a brief overview of the solutions to the standard problems (some mentioned in the paragraph above). Furthermore I researched the currently available and supported software solutions to these problems with particular attention to Apache OpenNLP which I used for implementation. After that I present the two iterations of the software development. Each iteration is detailed: design, implementation, testing and verification.

The software was written in Java and uses many libraries. The most important ones are Apache OpenNLP, SZTAKI annotare, Jetty and Apache Commons. The software communicates using HTTP API. Besides testing for functional appropriateness I also performed load tests to measure performance for each iteration which gives the reader a whole picture of the system.


Please sign in to download the files of this thesis.