Teaching AI to express sounds

Inspired by the mechanics of the larynx, a new artificial intelligence (AI) model can generate and understand everyday sound simulations.

This approach could aid in the development of new audio interfaces for the entertainment andeducation sectors.

Imitating sounds with your voice is like drawing a quick picture to convey something you see. Instead of using a pencil to illustrate the image, you use your vocal tract to represent the sound. While this may seem difficult, it is something people do naturally. Try imitating an ambulance siren, a crow's cry, or a bell to experience this.

Inspired by cognitive science about how we communicate, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an AI system capable of generating human-like simulations of sounds without any training and without ever having "heard" any human-simulated sounds before.

To achieve this, the team designed their system to produce and interpret sounds in the same way humans do. They started by building a model of the human vocal tract, simulating how vibrations from the larynx are shaped by the throat, tongue, and lips. They then used a cognitive-inspired AI algorithm to drive the model, generating sound simulations while taking into account the unique ways of communicating sounds in each context.

The model can reproduce a wide range of environmental sounds, such as rustling leaves, the hiss of a snake, or an ambulance siren. Furthermore, the model can work in reverse to predict real sounds from simulations of human speech, much like how some computer vision systems can reproduce high-quality images from sketches. For example, the model can accurately distinguish between a cat’s “meow” and “purr” when imitated by a human.

In the future, this model could lead to more intuitive “simulation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to help students learn foreign languages.

The study's lead authors—MIT CSAIL graduate students Kartik Chandra and Karima Ma, and graduate student Matthew Caren—note that computer graphics researchers have long recognized that realism is not the ultimate goal of visual expression. For example, an abstract painting or a child's doodle can be just as expressive as a photograph.

The art of sound simulation in 3 stages

The team developed three increasingly sophisticated versions of the model to compare with human sound simulations. First, they created a basic model that focused only on generating simulations that were as close to real sounds as possible, but this model did not match human behavior.

The team then designed a second model called the “communication” model. According to Caren, this model takes into account the elements of the sound that are distinctive to the listener. For example, you might imitate the sound of a ship by simulating the roar of its engine, because that is the most recognizable feature of the sound, even though it is not the loudest element (like the sound of water lapping). This model improved significantly over the first version.

Finally, the team added a layer of reasoning to the model. “The simulated sounds can vary depending on how much effort you put into them,” Chandra explains. “It takes time and energy to produce accurate sounds.” The team’s final model takes this into account by avoiding sounds that are too fast, too loud, or too high/low—elements that are less likely to occur in normal speech. The result is more human-like simulations that reflect many of the decisions humans make when imitating similar sounds.

Towards more expressive sound technology

This model could help artists better communicate sounds with computational systems, helping filmmakers and content creators create more contextually relevant AI sounds. It could also allow musicians to quickly search sound databases by simulating a noise that is difficult to describe in text.

Meanwhile, the team is looking at applications of the model in other areas, including language development, how babies learn to speak, and the mimicry behavior of birds such as parrots and songbirds.

However, the current model still has some limitations: it struggles with consonants like “z,” leading to inaccurate simulations of sounds like a bee buzzing. Additionally, it can’t yet replicate how humans imitate speech, music , or sounds that are imitated differently in different languages, like a heartbeat.

“The transition from a real cat sound to the word ‘meow’ shows the complex interplay between physiology, social reasoning, and communication in the evolution of language,” said Robert Hawkins, a professor of linguistics at Stanford University. “This model is an exciting step forward in formalizing and testing theories about these processes.”

(Source: MIT News)

Source: https://vietnamnet.vn/day-ai-bieu-dat-am-thanh-2362906.html