In Natural Language Processing, there is need for word representations which contain not only the form, but also the semantics of a particular word, and are processable for computers. As the distributional hypothesis explains, the meaning of a word can be determined by looking at the words which occur frequently in its context. Embedding-based methods, among others, make use of this idea. Embeddings are relatively low-dimensional vectors, generated using deep neural networks. In the case of English text, vectors are usually generated separately for each word form. Hungarian is a morphologically very rich, free word order language, so generating embedding vectors for each token present in a text may not be as efficient and usable as in the case of English text. Generating embeddings for smaller units - morphemes or morphs – is possible, but doing so will potentially affect the semantic and syntactic performance of the models negatively.
In my thesis, I compare the semantic and syntactic performance of embeddings trained on a news dataset, embeddings trained on a corpus which contains only the dictionary form of the words, and embeddings trained on a corpus where the words are partly morphologically segmented. The models are evaluated using Hungarian test sequences, which I created by translating the contents of the Google Analogy Test Set.