POV: You’re collecting recordings of human speech to create a dataset that helps improve how voice tech understands people. And by people we mean all peoples, not just ones who speak languages belonging to developed nations or dialects associated with privileged groups. How would you create that dataset? How would you include voice data that goes beyond the voice you put on when you say, “Alexa, set a timer for 20 minutes?” How do you collect speech data that authentically sounds like you’re talking with another person?
A starting point is the spontaneous speech project, Mozilla Common Voice’s latest initiative. Common Voice seeks to collect voice data from folks all over the world to offer software developers tools to help voice systems better recognize everybody. Here’s how data donation with Common Voice usually goes: a prompt appears on screen, users read the prompt, rinse and repeat. It looks like this. “This creates a dataset we call ‘prepared speech,’” says EM Lewis Jong. EM is Common Voice’s product director. “The upside is that all these audio clips are essentially already transcribed which is crucial because developers need transcribed audio to build out their speech recognition systems. The downside is that people speak very differently when they’re reading versus when they’re speaking.”
Unless you’re Vulcan, the way you speak to computers is markedly different from the way you speak with people. With computers we often very clearly enunciate “set a 20-minute timer” to Siri or frustratingly to “customer service” providers at our bank that insist on keeping us in an automated response system doom loop. “When we talk with others, we use slang, there are pauses, there are disfluencies,” says EM. “You tend not to do any of those things when just reading sentences aloud. What that means is that data about how people organically speak, how they colloquially engage with others, is really useful data for developers as well.”
How Do You Collect Speech Data Spontaneously & Privately?
Those concerned about privacy need not worry: audio that Common Voice collects is all anonymized by stripping out personal information before adding it to the open source database. But how exactly do you collect speech data spontaneously?
According to EM, a spontaneous speech request is much more open-ended than a prepared speech. “We’ll serve people a prompt instead of a sentence and they’ll respond.” For example, a prepared speech prompt could say “I’m going to the beach this weekend” whereas a spontaneous speech prompt could ask “what are your plans this weekend?”
Why Collect Spontaneous Speech?
There’s an obvious advantage to collecting spontaneous speech data: nearly everyone sounds different when they talk to a computer compared to when they talk to a human. Additionally, for some communities, there are other pros to compiling spontaneous speech data. “Surprisingly, few datasets accommodate code-switching,” says EM. “There's an added benefit if you belong to a community that uses multiple languages over a single sentence or might use multiple languages in a single conversation. While this is a lived reality for many communities, datasets rarely include this.”
It can be important to cater to communities that identify with this experience if this is the sort of inclusion those communities are eager to see, but not just for Alexa and Siri. “Health care translation is one space where we hope to see this sort of dataset used,” says EM. “Imagine you’re a doctor working in a hospital and you’re serving a community that speaks a language you don’t speak. Imagine media contexts where you’re building captioning services in unsupported languages. Imagine educational uses where you’re helping people learn a second language in a context where they can’t find a human conversation partner. It’s these sorts of scenarios where we see the spontaneous speech dataset being most useful.”
Speech Data, But Make It Open Source
One challenge for the Common Voice team in collecting this and other types of speech data is the fact that the data is always changing. Languages are alive and change every day. Keeping this sort of data open source allows researchers to continue to keep up. “On the one hand, we worry about AI training datasets that involve scraping tons of web content without engaging with people to ensure they’re getting what they need,” says EM. “On the other hand, it’s important to us that we keep this data open source. We also include speech from 100-year-old literature in our dataset since the copyright has expired. The downside is, whose voices were most prominent 100 years ago? Not those belonging to Black and Indigenous peoples.” Needless to say if those sorts of texts don’t include underrepresented voices, they surely don’t include those communities mentioned before where certain multiple languages exist in the same sentence. Keeping the data open source allows researchers to learn from Common Voice’s speech collection — in the present, 100 years from now and everything in between.
Mozilla Common Voice’s Latest Ambition: Getting Voice Tools To Understand Natural Conversation & Casual Speech
Written By: Xavier Harding
Edited By: Audrey Hingle, Kevin Zawacki, Tracy Kariuki
Art By: Shannon Zepeda
 
            
         
                
             
                
             
                
            