Fundamental frequency, also known as pitch is among the most important features of human speech. Modern, state-of-the-art Text-to-Speech solutions are still coping with proper modeling of fundamental frequency when trying to make it sound more natural, more diverse and thus more acceptable to human listeners.
I reviewed some available pitch extractor algorithms and speech modeling techniques. I manually classified the content of a speech corpus by percepted speech dynamism and made some statistical analysis and evaluation on the results.
I modeled the pitch of declarative sentences in human speech using modern machine learning algorithms. I chose recurrent Long Short-Term Memory-based deep neural networks to use for modeling. Prior to modeling speech parameters, I had to do some research on neural networks and also make evaluations by testing these structures and methods in practice.
After successful modeling of pitch sequences I combined the model with my previous research on percepted speech dynamism. My proposed method allows to choose a desired level of percepted speech dynamism to use for pitch sequence generation. Beside the Long Short-Term Memory-based deep neural network I also used Random Forest classification in this method.
Finally I performed objective and subjective evaluations on the model.