Creating a Multi-Agent Incident Response System with OpenAI

Overview

In this tutorial, we build an advanced yet practical multi-agent system using OpenAI Swarm that runs in Colab. We demonstrate how we can orchestrate specialized agents, such as a triage agent, an SRE agent, a communications agent, and a critic, to collaboratively handle a real-world production incident scenario.

Agent Handoffs and Tool Integration

By structuring agent handoffs, integrating lightweight tools for knowledge retrieval and decision ranking, and keeping the implementation clean and modular, we show how Swarm enables us to design controllable, agentic workflows without heavy frameworks or complex infrastructure.

Setting Up the Environment

We set up the environment and securely load the OpenAI API key so the notebook can run safely in Google Colab. We ensure the key is fetched from Colab secrets when available and fall back to a hidden prompt otherwise. This keeps authentication simple and reusable across sessions.

!pip -q install -U openai
!pip -q install -U "git+https://github.com/openai/swarm.git"
 
import os
 
def load_openai_key():
   try:
       from google.colab import userdata
       key = userdata.get("OPENAI_API_KEY")
   except Exception:
       key = None
   if not key:
       import getpass
       key = getpass.getpass("Enter OPENAI_API_KEY (hidden): ").strip()
   if not key:
       raise RuntimeError("OPENAI_API_KEY not provided")
   return key
 
os.environ["OPENAI_API_KEY"] = load_openai_key()

Initializing the Swarm Client

We import the core Python utilities and initialize the Swarm client that orchestrates all agent interactions. This snippet establishes the runtime backbone that allows agents to communicate, hand off tasks, and execute tool calls.

import json
import re
from typing import List, Dict
from swarm import Swarm, Agent
 
client = Swarm()

Creating the Knowledge Base

We define a lightweight internal knowledge base and implement a retrieval function to surface relevant context during agent reasoning.

KB_DOCS = [
   {
       "id": "kb-incident-001",
       "title": "API Latency Incident Playbook",
       "text": "If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts."
   },
   {
       "id": "kb-risk-001",
       "title": "Risk Communication Guidelines",
       "text": "Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging."
   },
   {
       "id": "kb-ops-001",
       "title": "On-call Handoff Template",
       "text": "Include summary, timeline, current status, mitigations, open questions, next actions, and owners."
   },
]
 
def _normalize(s: str) -> List[str]:
   return re.sub(r"[^a-z0-9\s]", " ", s.lower()).split()
 
def search_kb(query: str, top_k: int = 3) -> str:
   q = set(_normalize(query))
   scored = []
   for d in KB_DOCS:
       score = len(q.intersection(set(_normalize(d["title"] + " " + d["text"]))))
       scored.append((score, d))
   scored.sort(key=lambda x: x[0], reverse=True)
   docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
   return json.dumps(docs, indent=2)

Evaluating Mitigation Strategies

We introduce a structured tool that evaluates and ranks mitigation strategies based on confidence and risk.

def estimate_mitigation_impact(options_json: str) -> str:
   try:
       options = json.loads(options_json)
   except Exception as e:
       return json.dumps({"error": str(e)})
   ranking = []
   for o in options:
       conf = float(o.get("confidence", 0.5))
       risk = o.get("risk", "medium")
       penalty = {"low": 0.1, "medium": 0.25, "high": 0.45}.get(risk, 0.25)
       ranking.append({
           "option": o.get("option"),
           "confidence": conf,
           "risk": risk,
           "score": round(conf - penalty, 3)
       })
   ranking.sort(key=lambda x: x["score"], reverse=True)
   return json.dumps(ranking, indent=2)

Managing Agent Handoffs

We define explicit handoff functions that enable one agent to transfer control to another.

def handoff_to_sre():
   return sre_agent
 
def handoff_to_comms():
   return comms_agent
 
def handoff_to_handoff_writer():
   return handoff_writer_agent
 
def handoff_to_critic():
   return critic_agent

Configuring Specialized Agents

We configure multiple specialized agents, each with a clearly scoped responsibility and instruction set.

triage_agent = Agent(
   name="Triage",
   model="gpt-4o-mini",
   instructions="""
Decide which agent should handle the request.
Use SRE for incident response.
Use Comms for customer or executive messaging.
Use HandoffWriter for on-call notes.
Use Critic for review or improvement.
""",
   functions=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)
 
# Other agents definition...

Running the Orchestration Pipeline

We assemble the full orchestration pipeline that executes triage, specialist reasoning, and critical refinement in sequence.

def run_pipeline(user_request: str):
   messages = [{"role": "user", "content": user_request}]
   r1 = client.run(agent=triage_agent, messages=messages, max_turns=8)
   messages2 = r1.messages + [{"role": "user", "content": "Review and improve the last answer"}]
   r2 = client.run(agent=critic_agent, messages=messages2, max_turns=4)
   return r2.messages[-1]["content"]

Conclusion

By establishing a clear pattern for designing agent-oriented systems with OpenAI Swarm, we have demonstrated how to route tasks intelligently and improve output quality via a critic loop. This approach allows us to scale from experimentation to operational use cases.