Artificial intelligence models struggle to capture the full range and nuance of human speech expressivity
In a fascinating summer research opportunity, three undergraduate students at the University of Pennsylvania are delving into the world of speech production and perception. The Penn Undergraduate Research Mentoring Program (PURM) is sponsoring this 10-week project, which is being led by Jianjing Kuang, an associate professor of linguistics in the School of Arts & Sciences and director of the Penn Phonetics Laboratory.
The team consists of Kevin Li, a second-year computer science student, Henry Huang, another second-year computer science student, and Ethan Yang, a third-year mechanical engineering major from Diamond Bar, California. Their research focuses on comparing human and AI speech, a topic that has garnered significant attention in the field of artificial intelligence (AI).
The students have designed a perception experiment, asking human listeners to rate the naturalness of an audio clip and identify whether the speaker is human or AI. They generated the sentence "Molly mailed a melon" in 15 AI text-to-speech (TTS) platforms and are capturing audio from human volunteers in Kuang's recording studio for the comparison.
The team's findings reveal "huge variability among the models." Some models struggled to emphasize certain words, while others, such as OpenAI and Google Gemini, were more capable. Interestingly, speech robots had an easier time emphasizing "Molly" than words later in the sentence.
Kevin Li discovered that the average word duration for the word "mailed" was significantly longer from humans than from any of the speech robots. The students also found that most TTS models failed to focus on the correct place in the sentence.
Jianjing Kuang believes that working with AI has implications for better understanding human speech and its uniqueness. He aims to build bridges between science and industry, stating that AI needs their knowledge to determine how good the model is and help move us closer to truly natural and expressive AI speech.
The accuracy for identifying human versus AI was very high, suggesting that AI speech is still not human-like. This research project not only sheds light on the current state of AI speech but also paves the way for future advancements in this field.
As the students continue their research, they are making significant strides in understanding the intricacies of human and AI speech. Their work underscores the importance of undergraduate research opportunities and the potential impact they can have on the future of AI and speech synthesis.
Major players in the field of AI and speech synthesis, such as NVIDIA and OpenAI, contribute significantly to advancements in generative models and speech technologies. As the students' research progresses, it will be interesting to see how their findings align with and influence the work of these industry leaders.
Read also:
- Understanding the Concept of Obesity
- Microbiome's Impact on Emotional States, Judgement, and Mental Health Conditions
- Highlighting the Year 2025 IFA Awards, our site showcases the top 10 cutting-edge technologies unveiled this year.
- Guide to Natural Protection in Permaculture Gardens: Implementing Mulching Techniques Organically