Text preparation and classification in dialogs

OData support
Dr. Szűcs Gábor
Department of Telecommunications and Media Informatics

On every mailing list with an assistance or support like activity the same issues occur more than once in a while, but with different wording. The presented software’s purpose is to find the most likely answer to a new question with the help of text processing and classification methods.

The document presents the options investigated as an alternative to achieve the particular tasks. I detail the steps in the preparation of text from decomposition of the text to tokens (in this case to words), through stop word filtering to stemming. The decomposition is followed by a stop word filtering, resulting the deletion of common words that occur in every document. The next step is stemming of the words relevant to the analysis, for which I used the Hungarian Tordai stemming algorithm version Light2 and the hunmorph database.

The developed system is suitable for the text classification and for providing the appropriate response. For classification I use the Naive Bayes classifier based on the Bayesian decision with Laplace smoothing.

The classification and text processing problem fits well with the creation of a text-based emotion classifier component in the VIRCA system, in which the spoken text after speech-to-text recognition should be classified to the six basic emotions. The six basic emotions examined are anger, sadness, joy, fear, surprise, disgust. This paper presents the framework for VIRCA and the finished components during operation.

In the rest of the thesis I introduce the technologies for the implemented software’s environment, I present the operation of the system, and the creating process of the dictionaries used in the classification. With the help of the created dictionaries and the classifier I show that a substantial number of issues can be classified in the corresponding classes, thus the accuracy of the automated question answering is satisfactory.


Please sign in to download the files of this thesis.