Syntactic and Semantic Analysis of Text Based Representations in Agglutinating Hungarian

OData support
Dr. Szaszák György József
Department of Telecommunications and Media Informatics

In Natural Language Processing, there is need for word representations which contain not only the form, but also the semantics of a particular word, and are processable for computers. As the distributional hypothesis explains, the meaning of a word can be determined by looking at the words which occur frequently in its context. Embedding-based methods, among others, make use of this idea. Embeddings are relatively low-dimensional vectors, generated using deep neural networks. In the case of English text, vectors are usually generated separately for each word form. Hungarian is a morphologically very rich, free word order language, so generating embedding vectors for each token present in a text may not be as efficient and usable as in the case of English text. Generating embeddings for smaller units - morphemes or morphs – is possible, but doing so will potentially affect the semantic and syntactic performance of the models negatively.

In my thesis, I compare the semantic and syntactic performance of embeddings trained on a news dataset, embeddings trained on a corpus which contains only the dictionary form of the words, and embeddings trained on a corpus where the words are partly morphologically segmented. The models are evaluated using Hungarian test sequences, which I created by translating the contents of the Google Analogy Test Set.


Please sign in to download the files of this thesis.