NeuTTS Air: 748M On-Device TTS With Instant 3‑Second Voice Cloning

Overview

Neuphonic has open-sourced NeuTTS Air, a text-to-speech (TTS) speech language model engineered to run locally in real time on CPUs. The model card on Hugging Face reports a 748M-parameter artifact under the qwen2 architecture. NeuTTS Air ships in GGUF quantizations (Q4/Q8), enabling inference through llama.cpp / llama-cpp-python without cloud dependencies. The release is licensed under Apache-2.0 and includes runnable demos, examples, and a hosted Space.

Model architecture and runtime

NeuTTS Air couples a lightweight Qwen backbone (reported as a 0.5B-class Qwen used as the LM) with Neuphonic’s NeuCodec audio codec. The hosted artifact is listed as 748M parameters under the qwen2 architecture on Hugging Face. NeuCodec provides low-bitrate acoustic tokenization and decoding, targeting roughly 0.8 kbps with 24 kHz output to keep representations compact for efficient on-device use.

The distribution uses GGUF format with Q4/Q8 quantizations and includes instructions for running via llama.cpp / llama-cpp-python plus an optional ONNX decoder path. Dependencies include espeak for phonemization and a Jupyter notebook demonstrating end-to-end synthesis.

Key features

On-device performance focus

Neuphonic positions NeuTTS Air for real-time generation on mid-range devices and emphasizes CPU-first defaults. While the model card does not publish explicit RTF/fps numbers, the GGUF quantizations and provided examples demonstrate a working local inference flow without requiring a GPU. The artifact and instructions target minimal dependencies to lower friction for edge deployment.

Voice cloning workflow

NeuTTS Air requires two inputs for cloning: (1) a reference WAV and (2) the transcript text for that reference. The system encodes the reference into style tokens and then synthesizes arbitrary text in the reference speaker’s timbre. Neuphonic recommends 3–15 seconds of clean, mono audio and supplies pre-encoded samples to simplify experimentation.

Privacy, watermarking, and licensing

The project is framed for privacy-focused, on-device use — no audio or text needs to leave the machine unless the user opts in. Generated audio includes a Perth (Perceptual Threshold) watermarker to help with provenance and responsible deployment. The code and models are distributed under an Apache-2.0 license, which is permissive for many deployment scenarios.

How NeuTTS Air compares

Open local TTS systems and GGUF-based pipelines already exist, but NeuTTS Air stands out for packaging a small LM plus a neural codec with instant cloning, CPU-first quantizations, watermarking, and permissive licensing. The vendor calls it the ‘world’s first super-realistic, on-device speech LM’; the verifiable facts are its size, formats, cloning procedure, runtime paths, and included tooling.

Practical notes and next steps

From a systems perspective, the 0.7B Qwen-class backbone with GGUF quantization paired with NeuCodec at 0.8 kbps/24 kHz is a pragmatic recipe for real-time CPU-only TTS that preserves timbre from short style references. Publishing concrete latency and cloning-quality benchmarks across commodity CPUs and varying reference lengths would help objectively compare NeuTTS Air to other local TTS pipelines. For now, the repository, model card, and examples make it straightforward to try local synthesis and cloning with minimal dependencies.

Try it

See the model card on Hugging Face and the GitHub repository for demos, notebooks, and usage examples. The release provides runnable examples and a hosted Space to experiment with voice cloning and on-device inference.