Synthesia’s Express-2 Avatars Blur Reality — Next Stop: Talking Back

September 4, 2025 · 4 min

A studio session and a digital double

Earlier this summer I visited Synthesia’s London studio to be filmed for an AI avatar. The room was bright and professional: umbrella lights, a tripod camera and a laptop autocue. I read a scripted, upbeat monologue while trying to keep gestures natural but restrained. The footage was used to train two versions of my avatar: one built with the older Express-1 pipeline and one with the new Express-2.

From awkward motion to convincing mannerisms

Synthesia first gained attention by mapping famous faces and dubbing them into other languages. Early avatars could look impressive at a glance but often betrayed themselves with jerky gestures, mismatched voice and facial expression, or awkward lip sync. Express-2 aims to fix those gaps. The new model produces more natural hand movements, finer facial microexpressions and voice cloning that preserves the speaker’s accent and intonation.

Watching my Express-2 clone felt technically impressive and strangely unnerving. It can deliver a polished, high-definition presentation that many people would take for a real recording. The avatar’s facial features and voice closely mirror mine, yet small inconsistencies remain: overly smooth skin tones on the palms, stiff strands of hair, glassy eyes and occasional odd intonations. Those tiny flaws sit next to a generally convincing performance.

The creation process, simplified

Creating avatars used to require long calibration sessions: reading scripts in different emotions, mouthing phonemes and repeating gestures. The Synthesia team has streamlined that workflow. During my shoot the technical supervisor encouraged natural gestures while warning against excessive motion. The footage captured in an hour was enough to produce both avatars, and Express-2 produced a noticeably more accurate likeness.

Express-2 no longer needs to see every emotion performed because it’s trained on much larger, more diverse datasets. That allows the rendering model to infer appropriate gestures and microexpressions more reliably, reducing the time and footage required to create a believable avatar.

How the models work together

Synthesia combined a set of audio and video models for Express-2. A voice cloning model preserves accent, intonation and expressiveness, avoiding the flattening into a generic voice that earlier systems sometimes produced. The Express-Voice model interprets the script’s tone, and a gesture-generation model proposes accompanying motions. A separate evaluator compares generated motions against the audio to choose the best alignment, and a powerful rendering model synthesizes the final avatar.

Express-2’s rendering model has billions of parameters, a substantial leap from Express-1’s few hundred million. That extra capacity both speeds up production and improves fidelity, enabling avatars to perform convincing microgestures and more synchronized speech and motion.

Narrowing the uncanny valley and the psychological effect

Experts note that the remaining giveaway often isn’t a single error but a sense of emptiness: the absence of genuine emotion. A psychology researcher who reviewed my avatar said she might not notice it was synthetic at first, but she felt an uncanny emptiness under the performance. That subtle lack of lived experience is what differentiates a polished synthetic speaker from a real person.

Part of my discomfort came from how the avatar exaggerated a bright, corporate cadence that doesn’t match my natural demeanor. Seeing a persistent, hyperenthusiastic version of myself felt alien. There are personal anxieties too: if avatars become common, they open new avenues for mischief and misuse, such as making someone else’s avatar utter embarrassing confessions or political statements.

Real-world applications and the next frontier

Today Synthesia focuses on corporate uses: internal communications, training, financial presentations and marketing. Other companies are also building avatar toolkits that let businesses generate videos quickly and affordably. Synthesia has partnered with Google’s Veo 3 model to embed generated video clips in its platform, which could broaden applications into education and entertainment.

The next major advance Synthesia highlights is interactivity. Imagine an avatar that understands conversational prompts and responds in real time — effectively pairing a ChatGPT-like dialogue system with a lifelike human face. That would enable personalized, adaptive learning videos and on-demand virtual presenters. Synthesia already offers clickable on-screen interactions for quizzes and is exploring fully conversational avatars that can pause, expand on a point or answer spontaneous questions.

Researchers warn that coupling agentic AI with realistic faces could deepen emotional attachments and increase persuasive power. An always-available, charismatic avatar might be harder to resist than text-based chatbots. Avatars optimized to maintain engagement could alter how humans connect, compete with human-to-human charisma and potentially encourage unhealthy dependencies.

As I watched my own Express-2 avatar, I imagined friendly but uncanny conversations with a version of myself that will never have my lived experiences. It might be an excellent presenter, a tireless trainer or an endlessly patient tutor — and only my closest circle would reliably tell me it isn’t really me.