Build and Tune AI Agents in Colab with Microsoft Agent-Lightning

September 1, 2025 · 4 min

What you’ll run in Colab

This walkthrough shows how to set up an advanced AI agent pipeline using Microsoft Agent-Lightning entirely inside Google Colab. You can run server and client components in a single runtime, define a small QA agent, push tasks to the server, update shared system prompts, and evaluate which prompt performs best.

Install and import required packages

Run the following in Colab to install dependencies and configure the OpenAI key and model:

!pip -q install agentlightning openai nest_asyncio python-dotenv > /dev/null
import os, threading, time, asyncio, nest_asyncio, random
from getpass import getpass
from agentlightning.litagent import LitAgent
from agentlightning.trainer import Trainer
from agentlightning.server import AgentLightningServer
from agentlightning.types import PromptTemplate
import openai
if not os.getenv("OPENAI_API_KEY"):
   try:
       os.environ["OPENAI_API_KEY"] = getpass(" Enter OPENAI_API_KEY (leave blank if using a local/proxy base): ") or ""
   except Exception:
       pass
MODEL = os.getenv("MODEL", "gpt-4o-mini")

Define a simple QA agent

Create a LitAgent subclass that runs a training rollout by sending the user prompt to the configured LLM via the server-provided system prompt and scoring the answer. The reward combines exact-match, token overlap, and brevity:

class QAAgent(LitAgent):
   def training_rollout(self, task, rollout_id, resources):
       """Given a task {'prompt':..., 'answer':...}, ask LLM using the server-provided system prompt and return a reward in [0,1]."""
       sys_prompt = resources["system_prompt"].template
       user = task["prompt"]; gold = task.get("answer","").strip().lower()
       try:
           r = openai.chat.completions.create(
               model=MODEL,
               messages=[{"role":"system","content":sys_prompt},
                         {"role":"user","content":user}],
               temperature=0.2,
           )
           pred = r.choices[0].message.content.strip()
       except Exception as e:
           pred = f"[error]{e}"
       def score(pred, gold):
           P = pred.lower()
           base = 1.0 if gold and gold in P else 0.0
           gt = set(gold.split()); pr = set(P.split());
           inter = len(gt & pr); denom = (len(gt)+len(pr)) or 1
           overlap = 2*inter/denom
           brevity = 0.2 if base==1.0 and len(P.split())<=8 else 0.0
           return max(0.0, min(1.0, 0.7*base + 0.25*overlap + brevity))
       return float(score(pred, gold))

Define tasks and candidate prompts

Prepare a small benchmark and several system prompts you want to test. Apply nest_asyncio to allow running async server and synchronous client threads inside Colab:

TASKS = [
   {"prompt":"Capital of France?","answer":"Paris"},
   {"prompt":"Who wrote Pride and Prejudice?","answer":"Jane Austen"},
   {"prompt":"2+2 = ?","answer":"4"},
]
PROMPTS = [
   "You are a terse expert. Answer with only the final fact, no sentences.",
   "You are a helpful, knowledgeable AI. Prefer concise, correct answers.",
   "Answer as a rigorous evaluator; return only the canonical fact.",
   "Be a friendly tutor. Give the one-word answer if obvious."
]
nest_asyncio.apply()
HOST, PORT = "127.0.0.1", 9997

Run the server and evaluate prompts

Start Agent-Lightning server, iterate prompts by updating the shared resource system_prompt, queue training tasks, and poll for completed rollouts. Compute average rewards to pick the best prompt:

async def run_server_and_search():
   server = AgentLightningServer(host=HOST, port=PORT)
   await server.start()
   print(" Server started")
   await asyncio.sleep(1.5)
   results = []
   for sp in PROMPTS:
       await server.update_resources({"system_prompt": PromptTemplate(template=sp, engine="f-string")})
       scores = []
       for t in TASKS:
           tid = await server.queue_task(sample=t, mode="train")
           rollout = await server.poll_completed_rollout(tid, timeout=40)  # waits for a worker
           if rollout is None:
               print(" Timeout waiting for rollout; continuing...")
               continue
           scores.append(float(getattr(rollout, "final_reward", 0.0)))
       avg = sum(scores)/len(scores) if scores else 0.0
       print(f" Prompt avg: {avg:.3f}  |  {sp}")
       results.append((sp, avg))
   best = max(results, key=lambda x: x[1]) if results else ("<none>",0)
   print("n BEST PROMPT:", best[0], " | score:", f"{best[1]:.3f}")
   await server.stop()

Run the client and orchestrate training

Launch the agent client in a background thread with multiple workers so it can poll the server and process tasks in parallel. Then run the server search loop:

def run_client_in_thread():
   agent = QAAgent()
   trainer = Trainer(n_workers=2)    
   trainer.fit(agent, backend=f"http://{HOST}:{PORT}")
client_thr = threading.Thread(target=run_client_in_thread, daemon=True)
client_thr.start()
asyncio.run(run_server_and_search())

How it works and what to try

Agent-Lightning separates orchestration (server) from workers (clients). The server stores shared resources like the system prompt and queues tasks; clients run rollouts and return rewards. In this pattern you can:

Quickly evaluate multiple system prompts and pick the best-performing one by average reward.
Scale workers locally or across machines to speed up rollouts.
Customize the reward function to reflect your task priorities (accuracy, concision, style).

This setup fits well for prompt search, automated evaluation loops, and iterative agent development in a single Colab environment.