黑料社区网

Skip to main content

Tackling Bias in Automatic Speech Recognition - Two Examples From Our Ongoing Work

Rosy Southwell is a postdoc research scientist at 黑料社区网 who holds a PhD in Cognitive Neuroscience from University College London, UK and an MS in Natural Sciences from University of Cambridge, UK.聽As part of iSAT, Rosy works on automatic speech recognition and processing to help extract as much information as possible from noisy audio recorded in the classroom.

Dr. Wayne Ward is a Research Professor at 黑料社区网 whose research involves applying supervised machine learning to the tasks of automatic speech recognition, dialog modeling and extracting semantic representations from speech and text. His recent focus has been on applying these technologies to questing answering and virtual tutoring systems.

AI systems that are designed to offer real-time classroom support need to be able to understand what students are saying鈥攁nd do so with high accuracy. This requires Automatic Speech Recognition (ASR), which is the process where spoken language is automatically converted into text.The text can then be used by an AI to understand how students are working together.聽

A key consideration when developing an AI system is how it is trained and the data it learns from. In the context of speech recognition, the AI is trained on a large collection of audio recordings from many different speakers. These systems have become a lot more accurate in recent years, especially for adults from particular demographics (native English speakers, white, US accent), but this does not reflect the diversity of speakers in the world, of course.

The question is: how will an AI perform in a classroom setting where it is mostly children and teenagers who are talking? They may come from diverse backgrounds, speak in a variety of accents, and use gen Z slang. In our work, we have found this domain to be significantly challenging for existing speech recognition systems鈥攊n part because it is still very uncommon for children's speech to be used for training an AI system. Let鈥檚 discuss two variables in our data where ASR shows its weaknesses: age and race.

First, let's look at how we can adapt ASR to work better for students of all ages. We have a lot of training data from adults and elementary school students, and a small amount of test data from 9th graders. The word error rate (WER), which is the percentage of words that get transcribed wrongly by the model, can help us figure out what鈥檚 going on. For models that have been trained on adult speech, WER is 8% for adult speakers, but on our evaluation set of 9th grader speech, WER reaches up to 56%. In other words, the ASR gets it wrong more than half of the time! The WER for elementary school kids鈥 models when tested on kids of the same age is about 9%, but for 9th graders it jumps to a whopping 46%. This shows that models trained on one age group do not really generalize well to a different age group.

We can make improvements by聽starting with adult models and using a process called 鈥渇ine-tuning鈥 where a model goes through additional training on different data to adapt it to a new domain. Fine-tuning on elementary school kids鈥 speech did improve the WER slightly to 41%. To address the scarcity of training data for specific age groups, we are working on new techniques to adapt models to different age groups using very small amounts of age-specific data.

Second, there is concern within the AI community about 鈥渁ccuracy biases鈥 in AI systems that can disadvantage certain demographics such as non-white speakers. As a team we have often discussed bias in AI models, identifying places where AI could be affected by bias, and how we can mitigate this. In some of our recent work, we found that a popular ASR tool on which we base automatic feedback for tutors is 24% less accurate for Black speakers when compared to white speakers because the acoustics of their voices are not as well understood by the AI. Without access to the data they used for training the AI, one likely reason for this accuracy bias is that the model was not shown enough speech from Black speakers when the AI was trained. If an AI can't "hear" individuals accurately, then this has consequences for its ability to provide helpful feedback! We used聽fine-tuning to reduce the accuracy gap between Black and white tutors by around a third, and also improved the ASR accuracy for both groups of tutors. But from just these two examples, it is clear that there is still a lot more work that needs to happen to overcome these bias issues in the future!