Automatic Recognition of Spontaneous Speech

OData support
Supervisor:
Dr. Szaszák György József
Department of Telecommunications and Media Informatics

Speech recognition may be considered as one of the most progressive fields of information technology. Many applications already contain speech recognition features or, in a fully mature case, the operation is totally based on the recognition. General and the specific (e.g. medical, legal) dictation systems or a GPS unit with voice control in cars, illustrate that speech technology becomes the more and more widespread in everyday life.

I did experiments in connection with speech recognition during my work this semester. I used a Hungarian speech recognizer of middle sized dictionary (so called MKBF), which was developed at the Laboratory of Speech Acoustics at the Budapest University of Technology and Economics. In the beginning, speech and text databases had to be prepared, for which the resources were provided by the University of Debrecen, Department of General and Applied Linguistics. Within the HuComTech project, initiated by them, (simulated) interviews with some 120 people as well as informal discussions were recorded, and the subjects read phonetically rich sentences, too. The "spontaneous speech" in the title refers to the records of the interviews, and I wanted these different spoken utterances to be recognized with the MKBF software.

My work this semester can be divided into two parts, a training and a testing phase. At first, acoustic and language models had to be trained from the databases. For these, I used the Praat acoustic analysis software, laboratory applications, and self-made scripts for processing high amount of data in batch mode.

As a result of acoustic model training, 36 phoneme models (1 Markov-model for each phoneme) were created, which contain statistical parameters, calculated from the records. To train the language model, I created a nearly 5000-word dictionary, and collected the interviewees’ utterances into a text file. The created bigram model contains probabilities of given word pairs, and with that, a more precise recognition can be achieved, because it takes into consideration the words that follow each other more often / less often in our language.

After the training, came the testing phase came, during which word accuracy values were investigated, and the results were also gathered about recognizing phonemes. Because of the initial 20-30% phoneme recognition rate, and the resulting 0-10% word accuracy values, I created new records (1-1 sentence of the interview was said per record, a total of 21 pieces per person) and trained also a new acoustic model on MRBA Hungarian database which contains better quality utterances. At first, using the original language model, I got word accuracy values between 50-60%, then training a new language model – which is made by only a limited number of sentences – I got values between 85-95%.

Downloads

Please sign in to download the files of this thesis.