Developing a language detection library

OData support
Ács Judit
Department of Automation and Applied Informatics

Nowadays having access to different online contents in various languages bears a big importance. This is because more than 50% of the content is not in English, and in addition, more than 75% of the users do not speak English. Being able to detect a language is very useful in machine translation and helps to improve the efficiency of procedures that support multilingual contents. The classical dictionary-based procedures have difficulties confronting informal texts and languages with rich morphology. Therefore, the use of statistical procedures are getting more and more spotlight, having a huge importance in the present thesis as well. Before detecting a language effectively, we need to model it first.

The language modeling seeks the answer to the following question: what is the n-th word that follows the (n-1)-th word. This has a great importance in many modern applications. Guessing the next word is important in the speech- and handwriting recognition or in the augmentative communication. For detecting orthographic errors, we need to predict the next letter. The augmentative communication systems help people with speech difficulties, those who can hardly speak or cannot speak at all and cannot use the sign language either. A very well known example is the physicist Stephen Hawkins. The communication system that he uses predicts the needed characters and words, leaving him with less than the 20% of the actual text to be typed.

This thesis deals with the above mentioned two subjects.


Please sign in to download the files of this thesis.