Intelligent multilingual dictionary building

OData support
Dr. Csorba Kristóf
Department of Automation and Applied Informatics

Vocabularies of human languages differ widely, but linguists, psychologists, cognitive scientists and AI researchers have all made effort to identify a common core vocabulary - see Ogden (1923); West (1953); Swadesh (1955); Schank (1972); Wierzbicka (1996); Goddard (2002); Boguraev et al. (1989); Mitchell (2008). A basic semantic dictionary unifying all these and similar efforts, containing some 3,000 items, was created as part of András Kornai’s Semantic

language technologies (\_eng.pdf) OTKA research, which provides the theoretical background and the institutional support for the intelligent multilingual dictionary building work proposed in this thesis.

For now, the semantic dictionary has bindings for four languages: English, Hungarian, Polish, and Latin. The aim of the work is to extend this to at

least forty languages. Since not all language processing tools such as stemmers, spellcheckers, and morphological analyzers are available for every

language, we took the top 50 languages of the world by adjusted Wikipedia size, see Kornai (2012)\nocite{Kornai:2012} to obtain a comparable list of 40.

The languages considered are: Arabic, Azerbaijani, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Limburgish, Lithuanian, Macedonian, Malagasy, Malay, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese.

An initial database, populated from Wiktionary, is unlikely to have more than 30-50 % coverage. Intelligent dictionary building encompasses to related sets of techniques: (i) those required for automatically identifying translation candidates and (ii) those required for verifying these. State of the art dictionary extraction (Melamed 2000, Saralegi et al. 2012) takes either parallel texts or machine readable dictionaries as input, but for the problem at hand neither of these resources are available at the requisite depth for going beyond Wiktionary. We have created our own parallel corpora (based on widely translated documents such as the Bible or the Book of Mormon, see Halacsy et al 2008) and exploit the cross-linking structure of wikipedias to obtain near-parallel texts.


Please sign in to download the files of this thesis.