The goal of this thesis is to propose a neural network based vocoder which is able to compete with todays state of the art vocoders. Low frequency components of the speech are produced in the glottis, while high frequency components of speech are produced in the vocal tract. Using this idea we can split the speech synthesis into two parts: generate the low frequency speech by a current state of the art speech synthesizer and upsample this audio with a simulated vocal tract. The current state of the art speech synthesizer are parallel models and neural network based. To run them we need a GPU server, which is costly. The simulated vocal tract can be easily run on low performance CPUs. We can use a smaller neural network for predicting the low frequency speech. This lowers the load on
the GPU server. This low frequency speech can be upsampled on client side, for example on embedded system like mobile phones or home assistant devices. Our proposed model uses less computational power at an acceptable speech quality.