Build a Real-Time Voice AI Agent with Hugging Face Pipelines (Whisper + FLAN-T5 + Bark)

September 17, 2025 · 4 min

Overview

This tutorial demonstrates how to assemble a compact end-to-end voice AI agent using freely available Hugging Face models. The pipeline is simple enough to run in Google Colab and avoids external API keys or heavy dependencies. The agent listens, transcribes audio, reasons with an LLM, and returns natural-sounding speech.

Setup and pipelines

Install required libraries and initialize three Hugging Face pipelines: Whisper for automatic speech recognition (ASR), FLAN-T5 as a lightweight reasoning LLM, and Bark for text-to-speech (TTS). The code below shows the Colab cell that installs dependencies and loads the models.

!pip -q install "transformers>=4.42.0" accelerate torchaudio sentencepiece gradio soundfile


import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM


DEVICE = 0 if torch.cuda.is_available() else -1


asr = pipeline(
   "automatic-speech-recognition",
   model="openai/whisper-small.en",
   device=DEVICE,
   chunk_length_s=30,
   return_timestamps=False
)


LLM_MODEL = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")


tts = pipeline("text-to-speech", model="suno/bark-small")

The cell sets DEVICE automatically so the code will use GPU when available. No API keys are required — everything runs locally (or in Colab) using HF-hosted models.

Prompting and dialog formatting

A short system prompt guides the agent to give concise, structured replies. The helper function format_dialog composes the prompt by merging past turns and the current user message into a single instruction for the LLM.

SYSTEM_PROMPT = (
   "You are a helpful, concise voice assistant. "
   "Prefer direct, structured answers. "
   "If the user asks for steps or code, use short bullet points."
)


def format_dialog(history, user_text):
   turns = []
   for u, a in history:
       if u: turns.append(f"User: {u}")
       if a: turns.append(f"Assistant: {a}")
   turns.append(f"User: {user_text}")
   prompt = (
       "Instruction:\n"
       f"{SYSTEM_PROMPT}\n\n"
       "Dialog so far:\n" + "\n".join(turns) + "\n\n"
       "Assistant:"
   )
   return prompt

Core processing functions

Three functions implement the pipeline stages: transcribe uses Whisper for ASR, generate_reply uses FLAN-T5 to produce a context-aware response, and synthesize_speech uses Bark to render the reply as audio.

def transcribe(filepath):
   out = asr(filepath)
   text = out["text"].strip()
   return text


def generate_reply(history, user_text, max_new_tokens=256):
   prompt = format_dialog(history, user_text)
   inputs = tok(prompt, return_tensors="pt", truncation=True).to(llm.device)
   with torch.no_grad():
       ids = llm.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           temperature=0.7,
           do_sample=True,
           top_p=0.9,
           repetition_penalty=1.05,
       )
   reply = tok.decode(ids[0], skip_special_tokens=True).strip()
   return reply


def synthesize_speech(text):
   out = tts(text)
   audio = out["audio"]
   sr = out["sampling_rate"]
   audio = np.asarray(audio, dtype=np.float32)
   return (sr, audio)

Interaction helpers and state management

The code includes helpers to clear history, handle voice-to-voice and text-to-voice flows, and export conversations to a text file.

def clear_history():
   return [], []


def voice_to_voice(mic_file, history):
   history = history or []
   if not mic_file:
       return history, None, "Please record something!"
   try:
       user_text = transcribe(mic_file)
   except Exception as e:
       return history, None, f"ASR error: {e}"


   if not user_text:
       return history, None, "Didn't catch that. Try again?"


   try:
       reply = generate_reply(history, user_text)
   except Exception as e:
       return history, None, f"LLM error: {e}"


   try:
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history + [(user_text, reply)], None, f"TTS error: {e}"


   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}\nAssistant: {reply}"


def text_to_voice(user_text, history):
   history = history or []
   user_text = (user_text or "").strip()
   if not user_text:
       return history, None, "Type a message first."
   try:
       reply = generate_reply(history, user_text)
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history, None, f"Error: {e}"
   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}\nAssistant: {reply}"


def export_chat(history):
   lines = []
   for u, a in history or []:
       lines += [f"User: {u}", f"Assistant: {a}", ""]
   text = "\n".join(lines).strip() or "No conversation yet."
   with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:
       f.write(text)
       path = f.name
   return path

Gradio interface and wiring

A compact Gradio Blocks app provides a microphone recorder, text input, chat display, transcript box, audio output, and buttons for actions. The UI wires callbacks to the functions above so users can speak or type and hear the agent respond.

with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:
   gr.Markdown(
       "##  Advanced Voice AI Agent (Hugging Face Pipelines Only)\n"
       "- **ASR**: openai/whisper-small.en\n"
       "- **LLM**: google/flan-t5-base\n"
       "- **TTS**: suno/bark-small\n"
       "Speak or type; the agent replies with voice + text."
   )


   with gr.Row():
       with gr.Column(scale=1):
           mic = gr.Audio(sources=["microphone"], type="filepath", label="Record")
           say_btn = gr.Button(" Speak")
           text_in = gr.Textbox(label="Or type instead", placeholder="Ask me anything…")
           text_btn = gr.Button(" Send")
           export_btn = gr.Button(" Export Chat (.txt)")
           reset_btn = gr.Button(" Reset")
       with gr.Column(scale=1):
           audio_out = gr.Audio(label="Assistant Voice", autoplay=True)
           transcript = gr.Textbox(label="Transcript", lines=6)
           chat = gr.Chatbot(height=360)
   state = gr.State([])


   def update_chat(history):
       return [(u, a) for u, a in (history or [])]


   say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   reset_btn.click(clear_history, None, [chat, state])
   export_btn.click(export_chat, state, gr.File(label="Download chat.txt"))


demo.launch(debug=False)

How it works in practice

The Gradio UI captures audio or text.
Whisper transcribes spoken input into text.
The history-aware prompt is sent to FLAN-T5, which generates a concise assistant reply.
Bark synthesizes that reply into natural-sounding audio and the UI plays it back.

Extending the demo

This foundation lets you replace models with larger alternatives, add multilingual support, or integrate custom business logic. The modular pipeline design keeps components interchangeable and easy to experiment with in Colab.