Speech researchers are interested in the connection between articulation (movement of the speaking organs) and the acoustical output (the resulting speech signal). The movement of tongue contour during speech can be recorded by several technologies, e.g. ultrasound, EMA (Electromagnetic Articulograph), MRI (Magnetic Resonance Imaging) and X-ray.
For the tracking of the fast articulatory movement, the ultrasound is the most appropriate solution, because it is easy to use, has affordable cost, and one can record data at high resolution (800x600 pixel) and high speed (up to 100 frames/second). The drawback of the ultrasound technology in this context is that for doing additional measurements on the data, the extraction of the tongue contours is necessary from the recorded image sequence. Traditionally, tongue contour tracking had been done manually or semi-automatically, but recently automatic solutions have appeared for this purpose (e.g. AutoTrace).
In this research we investigate the Deep Neural Network (DNN) based techniques of automatic tongue contour tracking, which have gained attention nowadays. The ultrasound images of two speakers (one Hungarian and one American English) recorded at the Speech Production Laboratory of Indiana University are used. We analyze several DNN layouts of AutoTrace to decide which architecture and what kind of abstraction of the data is most suitable for this task.
Besides, we measure the extent how the automatic tongue contour tracking can approach the manual tracing, depending on the size of training data. We compare several error measures for the quantification of the typical errors (e.g. moving away from the original contour; missing tongue contour sections).