AI system for headphones translates multiple voices simultaneously with speakers' intonation
Anyone who has ever encountered such a phenomenon as simultaneous translation understands that such a format in its classic execution is not only resource-intensive, but also a slowing factor. Since in any case, the translator must first hear what is said, think over the translation and pronounce it. And when it comes to simultaneous translation by several speakers, it is an even more complicated story. And as a result, the listeners' interest and level of perception of information significantly decreases. And we put this aside for possible errors or inaccuracies of the translation. And there is no need to talk about the emotional coloring (which is often no less important than the information content) of such a format. However, there is hope that AI will soon be able to help solve such issues.
A team of researchers at the University of Washington has unveiled an innovative AI system for headphones that can simultaneously translate the speech of multiple people in real time, while preserving the intonation and direction of each person's voice. The solution, called Spatial Speech Translation, aims to overcome one of the biggest obstacles to automatic translation: situations where multiple people are speaking at the same time.
Imagine having lunch with friends who speak different languages. Even if you don't understand any of them, you'll still be able to follow the conversation - this is exactly what the authors of the new system were trying to achieve.
Spatial Speech Translation is able to determine the direction of sound and distinguish the voices of each interlocutor. This allows the headset user to understand who is saying what, even in noisy environments. This will erase language barriers regardless of conditions and circumstances. Isn't this what many people aspire to?
Unlike existing solutions like Ray-Ban's Meta smart glasses, which are focused on translating a single speaker, the new system works with multiple voices at the same time and provides much more natural-sounding translations. It is compatible with popular noise-canceling headphones and microphones connected to a laptop based on the Apple M2 chip. This chip is used, in particular, in Apple Vision Pro and supports real-time neural network calculations.
The development was presented this month at the ACM CHI conference on Human-Computer Interaction in Yokohama, Japan.
The system uses two artificial intelligences. The first — spatial — divides the surrounding space into sectors, detects speakers and determines the direction from which the sound is coming. The second — linguistic — translates from French, German or Spanish into English, using open datasets. Yes, the set of languages is still quite limited. But the uniqueness of the approach is that the system also reads intonation, volume, pitch of the voice — and reproduces them in the translation. As a result, the translation sounds almost like a “clone” of the original voice and comes from the corresponding direction, and not like a synthetic voice from headphones.
According to Samuel Cornell, a researcher at Carnegie Mellon University, voice recognition is an extremely difficult task for AI in itself, and here it has been combined with spatial positioning, real-time translation, and low latency — all on a real device.
However, the path from prototype to finished product is still long: significantly more training data and time will be needed, including “noisy” recordings from the real world, rather than from synthetic sources.
The team is now focused on reducing the delay between spoken words and their translation. The goal is to reduce it to less than a second, in order to preserve the natural dynamics of conversation. But this is a difficult task, because the structure of languages affects the speed of translation. For example, French is translated the fastest, followed by Spanish, but German is more difficult due to the characteristic structure of sentences, where the verb is often placed at the end, explains researcher Claudio Fantinuoli from the University of Mainz. The longer you wait before translating, the better it can be, because you have time to understand the context. But it is always a compromise between accuracy and speed, he notes.
Based on the translation of an article by Rhiannon Williams for MIT Technology Review