In the recent years, a consistent shift of focus is apparent in the text-to-speech research field leading from unit selection systems to statistical parametric speech synthesis. In these days, the research area of deep neural networks has seen a previously unprecedented popularity, revolutionizing several other fields of science. In this thesis paper, based on Hungarian and international scientific literature, a deep neural network based text-to-speech system (TTS) is presented and described in detail.
In this approach, the relationship between the phonetic transcription of the input text and speech parameters is modelled by a DNN. The first step to build an efficient DNN model was to create a training database for the deep neural network. Based on previous solutions and speech corpora of the Speech Technology and Smart Interactions Laboratory, Department of Telecommunications and Media Informatics of the Budapest University of Technology and Economics, this data was used to train several state of the art deep neural network architectures. In the process of finding the best network architecture and hyperparameters for our purposes and transforming the training database to a representation which is optimal for training, several theoretical questions – sometimes outside from the field of TTS – were considered. In this thesis paper, both the theoretical and implementation elements are discussed in detail. The primary speech parameters under consideration is the fundamental frequency of voiced speech (f0) and the speech spectral parameters. To further optimize the search for better solutions, a stochastic process based hyperparameter-optimization method was also implemented. The method is also described in this paper.
During my research, a great deal of effort was put into reducing the performance-degrading effect of bad prosodic stress estimation. This part of the research has also appeared in the form of a lecture and a conference paper – authored by my thesis advisors and me – at SPECOM 2016 conference.
The goal of this thesis paper is to reveal that – given a near-optimal setup of hyperparameters – enhancements can be achieved over today’s other competing TTS technologies.