Inspired by the mechanics of the larynx, a new artificial intelligence (AI) model can generate and understand simulations of everyday sounds.
This method can support the development of new audio interfaces for the entertainment and education sectors.

Mimicking sounds with your voice is like sketching a quick picture to convey something you've seen. Instead of using a pencil to illustrate the image, you use your vocalizations to express the sound. While this may seem difficult, it's something everyone does naturally. Try mimicking an ambulance siren, a crow's caw, or a bell to experience this.
Inspired by cognitive science on how we communicate, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an AI system capable of generating human-like sound simulations without training and without ever having "heard" any human-simulated sounds before.
To achieve this, the research team designed their system to produce and interpret sound in a way that mimics human speech. They began by building a model of the human vocal tract, simulating how vibrations from the larynx are shaped by the throat, tongue, and lips. Then, they used a cognitively inspired AI algorithm to manipulate this model, generating sound simulations while considering the specific ways of vocal communication in each context.
This model can reproduce a wide variety of environmental sounds, such as the rustling of leaves, the hissing of snakes, or the siren of an ambulance. Furthermore, the model can work in reverse to predict real sounds from human speech simulations, much like how some computer vision systems reproduce high-quality images from sketches. For example, the model can accurately distinguish between the sound of a cat meowing and the sound of a cat purring when imitated by a human.
In the future, this model could lead to more intuitive “simulation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to assist students in learning foreign languages.
The study's lead authors—graduate students Kartik Chandra (MIT CSAIL), Karima Ma, and research student Matthew Caren—note that computer graphics researchers have long recognized that realism is not the ultimate goal of visual expression. For example, an abstract painting or a child's doodle can be just as expressive as a photograph.
The art of sound imitation through 3 stages
The team developed three increasingly sophisticated versions of the model to compare with human sound simulations. First, they created a basic model that focused solely on producing simulations that most closely resembled real sounds, but this model did not match human behavior.
Next, the team designed a second model called the “communication” model. According to Caren, this model considers the characteristic elements of sound for the listener. For example, you can mimic the sound of a ship by simulating the roar of its engine, as that is the most recognizable characteristic of the sound, although it is not the most significant element (like the sound of water lapping, for instance). This model was a significant improvement over the first version.
Finally, the research team added another layer of reasoning to the model. Chandra explained, “The simulated sounds can vary depending on how much effort you put into it. Creating accurate sounds requires time and energy.” The team’s complete model accounts for this by avoiding sounds that are too fast, too loud, or excessively high/low – elements less likely to appear in normal communication. The result is more human-like sound simulations, reflecting many of the decisions humans make when imitating similar sounds.
Towards more expressive audio technology.
This model could help artists communicate sound with computing systems more effectively, assisting filmmakers and content creators in producing AI sounds that are more relevant to specific contexts. It could also allow musicians to quickly search sound databases by simulating a sound that is difficult to describe in writing.
Meanwhile, the research team is exploring applications of this model in other areas, including language development, how infants learn to speak, and the mimicry behavior of birds such as parrots or songbirds.
However, the current model still has some limitations: it struggles with consonants like “z,” leading to inaccurate simulations of sounds like buzzing. Additionally, it cannot yet replicate how humans mimic speech, music , or the different sounds mimicked in various languages, such as heartbeats.
Professor of linguistics Robert Hawkins at Stanford University commented: “The transition from the sound of a real cat to the word 'meow' demonstrates the complex interplay between physiology, social reasoning, and communication in the evolution of language. This model is an exciting step forward in formalizing and testing theories about these processes.”
(Source: MIT News)
Source: https://vietnamnet.vn/day-ai-bieu-dat-am-thanh-2362906.html






Comment (0)