Converting Open Access papers into JATS XML format

OData support
Dr. Lengyel László
Department of Automation and Applied Informatics

These days the number of articles being published on the Internet in an unrestrictedly accessible manner is growing at an ever-accelerating pace; thus automatically processing them becomes an ever more viable option. Unfortunately, only a small fraction of these publications can be found in a suitable format, in most cases they can only be downloaded as a PDF file.

Since this format was created with the intent of presenting content in a standardized form, it poses no problems for human readers, but programmatic data extraction is a significant difficulty. This is the motivation behind trying to convert the publications to a format that supports automated access to the data contained within.

In this thesis work, I endeavour to explore the solutions that are presently available for this problem. Furthermore, I am proposing a practical solution of my own for the issue by converting PDF articles into XML files conforming to the well-structured Journal Archiving and Interchange Tag Set 1.0 standard.


