Automatic Speech Recognition (ASR) is a field that has more than half a century of continual innovation in its past. Recently, recurrent neural networks have been introduced into the fray of cutting-edge technologies used by ASR systems.
The aim of my thesis was to apply the recent breakthroughs in ASR on a Hungarian corpus. Having been provided with an original feed-forward neural network that represented the currently used architecture, I set out to improve on this system’s performance by introducing recurrent neural networks.
During my work on this thesis, I have determined that by introducing a TDNN-LSTM architecture on top of the classical system, we can improve our Word Error Rate by 3,32% compared to our feed-forward network. Furthermore, this does not increase training times inordinately, and our system even outperforms the original in terms of decoding speed.
Because the classical system has an impact on the neural network’s final performance, hyperparameter searches have been made in order to determine the optimal configuration of the classical system, which have confirmed that the default values used by Kaldi recipes were optimal. Hyperparameter searches of the recurrent neural network show that the optimal size is a smaller network with 192 neurons per layer, as opposed with the much higher number of neurons used in international papers and results. The reason for this is the smaller size of the training corpora available in Hungarian. Thus, it is shown that even on a small corpus, the introduction of recurrent networks yields a significant improvement.