Run a Self-Hosted LLM in Colab with Ollama, REST Streaming and a Gradio Chat
'A practical Colab walkthrough to install Ollama, pull lightweight models, stream token outputs via /api/chat, and build a Gradio chat interface for interactive multi-turn testing.'
Setup and dependencies
This tutorial shows how to run a self-hosted LLM pipeline in Google Colab using Ollama, the Ollama REST API, and a Gradio-based chat UI. The flow covers installing Ollama on the VM, launching the Ollama server, pulling a lightweight model suitable for CPU-only Colab, streaming token-level outputs via the /api/chat endpoint, and wrapping everything with a Gradio interface for interactive multi-turn chat.
Install Ollama and prepare the environment
Begin by installing Ollama if it isn't present and ensure Gradio is available. The snippet below demonstrates a helper to run shell commands, install Ollama using the official script, and install Gradio when missing.
import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path
def sh(cmd, check=True):
"""Run a shell command, stream output."""
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for line in p.stdout:
print(line, end="")
p.wait()
if check and p.returncode != 0:
raise RuntimeError(f"Command failed: {cmd}")
if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():
print(" Installing Ollama ...")
sh("curl -fsSL https://ollama.com/install.sh | sh")
else:
print(" Ollama already installed.")
try:
import gradio
except Exception:
print(" Installing Gradio ...")
sh("pip -q install gradio==4.44.0")This prepares Colab for running the chat UI without relying on Docker. Gradio is installed only when missing to keep startup idempotent.
Launch the Ollama server and wait for readiness
Start the Ollama server as a background process and poll its tags endpoint until it responds. This ensures the server is up before you issue API requests.
def start_ollama():
try:
requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
print(" Ollama server already running.")
return None
except Exception:
pass
print(" Starting Ollama server ...")
proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for _ in range(60):
time.sleep(1)
try:
r = requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
if r.ok:
print(" Ollama server is up.")
break
except Exception:
pass
else:
raise RuntimeError("Ollama did not start in time.")
return proc
server_proc = start_ollama()The function returns the server process object if it launched, or None if a server was already running.
Pick and pull a model
Use a lightweight model appropriate for CPU-only Colab sessions. The script checks whether the chosen model is present on the server and pulls it if missing.
MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f" Using model: {MODEL}")
try:
tags = requests.get("http://127.0.0.1:11434/api/tags", timeout=5).json()
have = any(m.get("name")==MODEL for m in tags.get("models", []))
except Exception:
have = False
if not have:
print(f" Pulling model {MODEL} (first time only) ...")
sh(f"ollama pull {MODEL}")This automates model management so the rest of the workflow can assume the model is available locally to Ollama.
Build a streaming client for /api/chat
To capture incremental token output, use the REST streaming behavior of /api/chat. The function below posts a JSON payload with stream=True and yields text chunks as they arrive.
OLLAMA_URL = "http://127.0.0.1:11434/api/chat"
def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):
"""Yield streaming text chunks from Ollama /api/chat."""
payload = {
"model": model,
"messages": messages,
"stream": True,
"options": {"temperature": float(temperature)}
}
if num_ctx:
payload["options"]["num_ctx"] = int(num_ctx)
with requests.post(OLLAMA_URL, json=payload, stream=True) as r:
r.raise_for_status()
for line in r.iter_lines():
if not line:
continue
data = json.loads(line.decode("utf-8"))
if "message" in data and "content" in data["message"]:
yield data["message"]["content"]
if data.get("done"):
breakThis streaming generator is the core for returning partial model outputs to a UI in real time.
Run a quick smoke test
Send a tiny prompt to verify the streaming client and model are functioning:
def smoke_test():
print("n Smoke test:")
sys_msg = {"role":"system","content":"You are concise. Use short bullets."}
user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}
out = []
for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
print(chunk, end="")
out.append(chunk)
print("n Done.n")
try:
smoke_test()
except Exception as e:
print(" Smoke test skipped:", e)A successful smoke test confirms installation, server availability, and model readiness.
Build a Gradio chat UI with streaming responses
Layer a Gradio interface on top of the streaming client. The UI keeps multi-turn history, converts history into the Ollama message format, streams model output into the chat element, and exposes sliders for temperature and context size.
import gradio as gr
SYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."
def chat_fn(message, history, temperature, num_ctx):
msgs = [{"role":"system","content":SYSTEM_PROMPT}]
for u, a in history:
if u: msgs.append({"role":"user","content":u})
if a: msgs.append({"role":"assistant","content":a})
msgs.append({"role":"user","content": message})
acc = ""
try:
for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):
acc += part
yield acc
except Exception as e:
yield f" Error: {e}"
with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
gr.Markdown("# Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")
with gr.Row():
temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")
num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")
chat = gr.Chatbot(height=460)
msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)
clear = gr.Button("Clear")
def user_send(m, h):
m = (m or "").strip()
if not m: return "", h
return "", h + [[m, None]]
def bot_reply(h, temperature, num_ctx):
u = h[-1][0]
stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
acc = ""
for partial in stream:
acc = partial
h[-1][1] = acc
yield h
msg.submit(user_send, [msg, chat], [msg, chat])
.then(bot_reply, [chat, temp, num_ctx], [chat])
clear.click(lambda: None, None, chat)
print(" Launching Gradio ...")
demo.launch(share=True)The Gradio UI enables iterative testing of prompts, dynamic parameter tuning, and real-time display of streaming replies.
How the pieces fit
- Installation prepares the notebook runtime and ensures dependencies.
- The Ollama server provides the REST API and model hosting locally in the VM.
- The streaming /api/chat client yields incremental tokens so the UI can update live.
- Gradio wraps this client into an interactive multi-turn chat with parameter controls.
This approach reproduces a self-hosted LLM workflow in Colab where Docker and GPU-backed images may be impractical, enabling fast experimentation with multiple lightweight LLMs.
Сменить язык
Читать эту статью на русском