Testing human speaker verifiaction capabilities over the internet

OData support
Dr. Szaszák György József
Department of Telecommunications and Media Informatics

The Laboratory of Speech Acoustics of BME-TMIT received phone recordings from an insurance company. According to the supposition of the company, a man changed his voice, and he committed, or tried to commit, fraud under the names of several clients. The company asked the laboratory to analyze the recordings and prove the fraud. So I dealt with the topic of speaker recognition, which is related to this task.

I started collecting studies focusing on the subjective tests, then I specified test environments one can find on the Internet. After that, I planned, implemented and tested them iteratively. I used Silverlight technology to create the web interfaces of the test environments. In both cases, representative population (50 and 40 men/women) filled out the tests, and then I evaluated the results with Python-scripts and Excel-spreadsheets.

I stated three hypotheses and the followings can be said about them. The facts partly reject the first hypothesis, that is, listeners can identify a person’s mimic or recognize another speaker’s voice with a good rate according to a reference recording. The rate of right decisions spread between 17 and 95 percent in case of audio sample pairs, so it shows, that people didn’t choose randomly, but the impostor mimicked the clients with different success. The facts also reject the second hypothesis, that is, people achieve better results in speaker recognition if they listen to slightly more than one word (e.g. 2 or 3 words) by the speaker. The rate of the right decisions hardly grew, from 50 to 55 percent, according to the whole recordings. Nevertheless, the facts partly support my third hypothesis. People achieve better results in the recognition of mimics if the original speaker’s reference sample is available. To summarize it, testing human speaker recognition capabilities on the Internet showed that people are not good in speaker recognition in every case.

After that, I switched to the acoustic parameters’ research. Studying scientific papers, I chose some parameters to analyze, and then I planned the necessary scripts and implemented them in Praat software. I segmented and annotated the phone recordings, first automatically, then I made a manual correction, too. I executed the scripts, then I processed and evaluated the results with Matlab-measurements and Excel-spreadsheets. I compared my subjective evaluation with statistical tests.

I identified that in the case of subjective tests, the pitch parameter helped people the most to choose the right answer. Furthermore, if we look at the whole recordings, the Jitter parameters had important roles in identifying the mimics, and also in discriminating and identifying the speakers. These parameters express the frequency fluctuation of the vocal cord vibration. So after the subjective tests, which are related to the speaker recognition topic from the point of view of humans, I got tangible results in the objective side of this area, analyzing some acoustical parameters. I compared the results of both methods, then I made some conclusions. All in all, I achieved my goals.

I think that speaker recognition is one of the improving parts of speech technologies, it has further options for researchers, and its relationship with industry is more and more accentuate.


Please sign in to download the files of this thesis.