<RETURN_TO_BASE

Run a Self-Hosted LLM in Colab with Ollama, REST Streaming and a Gradio Chat

'A practical Colab walkthrough to install Ollama, pull lightweight models, stream token outputs via /api/chat, and build a Gradio chat interface for interactive multi-turn testing.'

Setup and dependencies

This tutorial shows how to run a self-hosted LLM pipeline in Google Colab using Ollama, the Ollama REST API, and a Gradio-based chat UI. The flow covers installing Ollama on the VM, launching the Ollama server, pulling a lightweight model suitable for CPU-only Colab, streaming token-level outputs via the /api/chat endpoint, and wrapping everything with a Gradio interface for interactive multi-turn chat.

Install Ollama and prepare the environment

Begin by installing Ollama if it isn't present and ensure Gradio is available. The snippet below demonstrates a helper to run shell commands, install Ollama using the official script, and install Gradio when missing.

import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path
 
 
def sh(cmd, check=True):
   """Run a shell command, stream output."""
   p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for line in p.stdout:
       print(line, end="")
   p.wait()
   if check and p.returncode != 0:
       raise RuntimeError(f"Command failed: {cmd}")
 
 
if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():
   print(" Installing Ollama ...")
   sh("curl -fsSL https://ollama.com/install.sh | sh")
else:
   print(" Ollama already installed.")
 
 
try:
   import gradio 
except Exception:
   print(" Installing Gradio ...")
   sh("pip -q install gradio==4.44.0")

This prepares Colab for running the chat UI without relying on Docker. Gradio is installed only when missing to keep startup idempotent.

Launch the Ollama server and wait for readiness

Start the Ollama server as a background process and poll its tags endpoint until it responds. This ensures the server is up before you issue API requests.

def start_ollama():
   try:
       requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
       print(" Ollama server already running.")
       return None
   except Exception:
       pass
   print(" Starting Ollama server ...")
   proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for _ in range(60):
       time.sleep(1)
       try:
           r = requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
           if r.ok:
               print(" Ollama server is up.")
               break
       except Exception:
           pass
   else:
       raise RuntimeError("Ollama did not start in time.")
   return proc
 
 
server_proc = start_ollama()

The function returns the server process object if it launched, or None if a server was already running.

Pick and pull a model

Use a lightweight model appropriate for CPU-only Colab sessions. The script checks whether the chosen model is present on the server and pulls it if missing.

MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f" Using model: {MODEL}")
try:
   tags = requests.get("http://127.0.0.1:11434/api/tags", timeout=5).json()
   have = any(m.get("name")==MODEL for m in tags.get("models", []))
except Exception:
   have = False
 
 
if not have:
   print(f"  Pulling model {MODEL} (first time only) ...")
   sh(f"ollama pull {MODEL}")

This automates model management so the rest of the workflow can assume the model is available locally to Ollama.

Build a streaming client for /api/chat

To capture incremental token output, use the REST streaming behavior of /api/chat. The function below posts a JSON payload with stream=True and yields text chunks as they arrive.

OLLAMA_URL = "http://127.0.0.1:11434/api/chat"
 
 
def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):
   """Yield streaming text chunks from Ollama /api/chat."""
   payload = {
       "model": model,
       "messages": messages,
       "stream": True,
       "options": {"temperature": float(temperature)}
   }
   if num_ctx:
       payload["options"]["num_ctx"] = int(num_ctx)
   with requests.post(OLLAMA_URL, json=payload, stream=True) as r:
       r.raise_for_status()
       for line in r.iter_lines():
           if not line:
               continue
           data = json.loads(line.decode("utf-8"))
           if "message" in data and "content" in data["message"]:
               yield data["message"]["content"]
           if data.get("done"):
               break

This streaming generator is the core for returning partial model outputs to a UI in real time.

Run a quick smoke test

Send a tiny prompt to verify the streaming client and model are functioning:

def smoke_test():
   print("n Smoke test:")
   sys_msg = {"role":"system","content":"You are concise. Use short bullets."}
   user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}
   out = []
   for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
       print(chunk, end="")
       out.append(chunk)
   print("n Done.n")
try:
   smoke_test()
except Exception as e:
   print(" Smoke test skipped:", e)

A successful smoke test confirms installation, server availability, and model readiness.

Build a Gradio chat UI with streaming responses

Layer a Gradio interface on top of the streaming client. The UI keeps multi-turn history, converts history into the Ollama message format, streams model output into the chat element, and exposes sliders for temperature and context size.

import gradio as gr
 
 
SYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."
 
 
def chat_fn(message, history, temperature, num_ctx):
   msgs = [{"role":"system","content":SYSTEM_PROMPT}]
   for u, a in history:
       if u: msgs.append({"role":"user","content":u})
       if a: msgs.append({"role":"assistant","content":a})
   msgs.append({"role":"user","content": message})
   acc = ""
   try:
       for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):
           acc += part
           yield acc
   except Exception as e:
       yield f" Error: {e}"
 
 
with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
   gr.Markdown("#  Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")
   with gr.Row():
       temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")
       num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")
   chat = gr.Chatbot(height=460)
   msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)
   clear = gr.Button("Clear")
 
 
   def user_send(m, h):
       m = (m or "").strip()
       if not m: return "", h
       return "", h + [[m, None]]
 
 
   def bot_reply(h, temperature, num_ctx):
       u = h[-1][0]
       stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
       acc = ""
       for partial in stream:
           acc = partial
           h[-1][1] = acc
           yield h
 
 
   msg.submit(user_send, [msg, chat], [msg, chat])
      .then(bot_reply, [chat, temp, num_ctx], [chat])
   clear.click(lambda: None, None, chat)
 
 
print(" Launching Gradio ...")
 demo.launch(share=True)

The Gradio UI enables iterative testing of prompts, dynamic parameter tuning, and real-time display of streaming replies.

How the pieces fit

  • Installation prepares the notebook runtime and ensures dependencies.
  • The Ollama server provides the REST API and model hosting locally in the VM.
  • The streaming /api/chat client yields incremental tokens so the UI can update live.
  • Gradio wraps this client into an interactive multi-turn chat with parameter controls.

This approach reproduces a self-hosted LLM workflow in Colab where Docker and GPU-backed images may be impractical, enabling fast experimentation with multiple lightweight LLMs.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский