Diacritic restoration in Hungarian

Ács Judit
Department of Automation and Applied Informatics

At the dawn of the informatics, the first-ever keyboard contained only English characters. Keyboard layouts have greatly evolved over time, but the application-level support for non-English characters was always falling behind. Lots of legacy data are being stored without diacritics, and data without diacritics is still being made nowadays.

The spread of personal devices, such as mobile phones, enabled the same process to took place again. Typing properly texts with diacritics on smartphones and tablets is still a big and time-consuming effort, thus converting non-diacreted texts into one with diacritics is a real need. Not to mention people, who work with different keyboard layout than their native language, which causes them to completely avoid using diacritics, for example programmers tend to use English keyboard to make the process faster, thus they communicate each other without diacritics by emails.

There are several approachments to automated diacritics insertion, and several attempts have been made worldwide, although not many have previously been made for Hungarian language. The base of this thesis is Gépi ékezés, Andras Kornai, 1997.

In this thesis, Hungarian language statistics are going to be demonstrated in comparison to other, similar languages, along with the typical metrics of counting errors, then the statistical machine translation SMT methods like dictionary-based or the grapheme-based diacritics restoration methods are going to be presented, along with their implementations and result statistics, thereafter their comparison along with pros and contras are going to be explained.


