These days the number of articles being published on the Internet in an unrestrictedly accessible manner is growing at an ever-accelerating pace; thus automatically processing them becomes an ever more viable option. Unfortunately, only a small fraction of these publications can be found in a suitable format, in most cases they can only be downloaded as a PDF file.
Since this format was created with the intent of presenting content in a standardized form, it poses no problems for human readers, but programmatic data extraction is a significant difficulty. This is the motivation behind trying to convert the publications to a format that supports automated access to the data contained within.
In this thesis work, I endeavour to explore the solutions that are presently available for this problem. Furthermore, I am proposing a practical solution of my own for the issue by converting PDF articles into XML files conforming to the well-structured Journal Archiving and Interchange Tag Set 1.0 standard.