Hungarian morphological disambiguation using sequence-to-sequence neural networks

OData support
Supervisor:
Ács Judit
Department of Automation and Applied Informatics

As the Hungarian language is agglutinative, it defines the meaning of the words by changing their forms. This is done with so-called morphemes - that are the smallest meaningful parts of a language - by appending them together (for example inflexion of a word). The consequence of this approach is the very rich and complex morphology of Hungarian words and they can be represented by morphological analyses. Also, the meaning of words is often ambiguous and only makes sense based on their context.

For instance, considering a sentence with 20 words where each word has 2 morphological analyses in average, the sentence has 2 raised to the power of 20 (that is 1 048~ 76) possible morphological analysis combinations altogether.

Taking the above into account, a morphological disambiguator chooses that morphological analysis for each word in a sentence which fits into the given context. In the case of the previous example, this means selecting the one correct combination of analyses from the 1 048 576. For instance, in our multicultural world, it could make a considerable difference whether computers during translation from one language to another take the correct meaning of each word or not, or when they summarise longer text. As a conclusion, this tool plays a significant role in the case of most, specific Natural Language Processing (NLP) task.

Deep Learning approaches have been successfully applied to NLP tasks, and also carried out state-of-the-art results. Nonetheless, their usability is not well-investigated yet on this topic. As natural language produces variable-length words and sentences, sequence-to-sequence neural networks might be promising.

In my thesis, I present and publish an open-source morphological disambiguator written for the Hungarian language, but during implementation I prepared it to be easily extensible for other languages. Besides, I evaluate it using the utilised character and morpheme-level fully-fledged sequence-to-sequence neural networks that are highly configurable. I also use sophisticated data visualisation algorithms in order to make the results as human-readable as possible.

Downloads

Please sign in to download the files of this thesis.