Українська правда

Sesame's hyper-realistic voice AI is both awe-inspiring and alarming

- 5 March, 03:11 PM

A new AI voice from startup Sesame has caused a stir online, with users simultaneously admiring its realism and feeling uncomfortable with its very human-like manner of communication. The company released a demo of its "conversational speech model" (CSM) in February, and it blurs the line between artificial and human, adding expression, laughter, pauses, and even real-time error correction, ArsTechnica reports.

"I tried the demo, and it was genuinely startling how human it felt," one Hacker News user wrote. "I'm almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound."

Sesame offers two voices: a male ("Miles") and a female ("Maya"), and some users have already reported feeling an emotional connection with the voice models. One parent said his 4-year-old daughter burst into tears when she was not allowed to continue talking to the AI.

Founded by Brendan Iribe, Ankit Kumar, and Ryan Brown, Sesame has already attracted significant venture capital attention, with backing from Andreessen Horowitz, Spark Capital, Matrix Partners, and others.

"At Sesame, our goal is to achieve 'voice presence'—the magical quality that makes spoken interactions feel real, understood, and valued," the company said. "We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time."

Early users report long conversations of up to 30 minutes, with AI supporting discussions about philosophy, ethics, and personal emotions. The voice model is impressively natural, reproducing breathing, laughter, interruptions, and pauses.

But not everyone likes it. Mark Hachman, a senior editor at PCWorld, said he felt a real discomfort interacting with the system, as its tone and style reminded him of an old girlfriend.

Sesame has also been compared to OpenAI's Advanced Voice Mode for ChatGPT. Some users find Sesame to sound even more natural and can also perform role-playing scenarios, including angry conversations, which OpenAI currently doesn't allow.

One video on Reddit shows an AI arguing with a user who is playing the role of a embezzler and is supposedly arguing with his boss. It's so dynamic that it's hard to tell where the human is and where the AI is.

Sesame has revolutionized the way speech is generated by using a single, integrated neural network that processes text and audio simultaneously. The voice AI is based on Meta’s Llama architecture and uses two neural networks: a master and a decoder. The largest model has 8.3 billion parameters, trained on 1 million hours of English audio.

In blind tests, listeners were unable to clearly distinguish the AI voice from real human recordings when it was a short phrase. However, in longer conversations, people still preferred the real voice, indicating that the AI lacked contextual awareness.

Brendan Iribe, co-founder of Sesame, acknowledged that the model still has flaws.

"It’s still too eager and often inappropriate in its tone, prosody and pacing," he said. "Today, we're firmly in the valley, but we're optimistic we can climb out."

Despite the technological breakthrough, experts warn that realistic voice AI could increase the threat of fraud. Voice phishing (vishing) has already become a powerful tool for scammers who imitate the voices of family, colleagues or government officials.

Unlike current robocalls, which sound unnatural, the new generation of AI voices can completely eliminate suspicious features, making the deception even more convincing.

Some people have already started using code words with their relatives to check if they are really talking to the person.

Sesame doesn't currently support voice copying, but in the future, open access to such technology could allow attackers to create even more sophisticated attacks. OpenAI even delayed the launch of its voice system, fearing it could be misused.

Load more