The amount of the unstructured textual data grew exponentially in the twentieth century. The efficient processing of the data was impossible until the last few years. The science of ‘text mining’ delivers the practical tools and theoretical background of information extraction and retrieval from textual data. The mining of recorded speech needs a transformation from spoken form into a written representation by machine speech recognition which is also a recent technology.
My thesis is based on the interdisciplinary field of speech recognition and text mining, as text mining on different types of recognized records has been performed.
First, the document introduces the basics of the applied technologies in order familiarize the reader with the key concepts. Chapter 3.1 introduces the the tools used such as SPSS Statistics System by IBM and Clemtext 2.0 developed by Clementine Consulting. Chapter 4 introduces the process of text mining with the emphasis on the data extraction and reduction tasks. After preprocessing and dictionary building the lexical classification of the words is performed, then pattern recognition and category building is applied. In Chapter 5, a comparative analysis is performed on the text mining results obtained with various input data - such as parliamentary speech transcriptions produced by means of human and machine efforts.