Building a global dictionary for semantic technologies

OData support
Recski Gábor András
Department of Automation and Applied Informatics

Computer-driven natural language processing plays an increasingly important role in our everyday life. In our digital world, using natural language for human-machine communication has become a basic requirement. In order to meet this requirement, it is inevitable to analyze human languages semantically.

Nowadays, state-of-the-art systems represent word meaning with high dimensional vectors, i.e. word embeddings. Within the field of computational semantics a new research direction focuses on finding mappings between embeddings of different languages a (Mikolov et al., 2013b), (Smith et al., 2017), (Conneau et al., 2017).

This thesis work proposes a novel method for finding linear mappings between word vectors for various languages. Compared to previous approaches, this method does not learn translation matrices between two specific languages, but between a given language and a shared, universal space. As input data the system requires pre-trained word embeddings and a word translation dictionary for the given languages.

For experiments the \textit{fastText} embeddings (Conneau et al., 2017) were used. First, the system was trained between two languages applying two different training data; Dinu's English-Italian benchmark data (Dinu et al., 2014), and English-Italian translation pairs extracted from the PanLex database (Kamholzet al., 2014) . Thereafter, the system was trained on three languages - English, Italian, and Spanish - at the same time using multilingual translation pairs extracted from the PanLex database.

The system performs on English-Italian languages with the best setting significantly better than the baseline system of Mikolov et al. (2013b), and it provides a comparable performance with the more sophisticated systems of Faruqui and Dyer (2014) and Dinu et al. (2014). Current state-of-the-art systems, however, are still much better than the proposed one. Training the system on three languages at the same time gives worse results than training it on the languages pairwise. Exploiting the richness of the PanLex database, the proposed method makes it possible to learn linear mappings between arbitrary language pairs.


Please sign in to download the files of this thesis.