Arena-as-a-Judge: How to Compare LLM Outputs Head-to-Head
'Learn how to set up an Arena-as-a-Judge workflow to compare LLM outputs head-to-head using GPT-5 as an evaluator. The tutorial includes code, sample prompts, and interpretation of evaluation logs.'
What the Arena-as-a-Judge Approach Is
The Arena-as-a-Judge approach evaluates large language model outputs by running head-to-head comparisons between responses rather than assigning isolated numeric scores. You define the evaluation criteria — for example, helpfulness, clarity, tone, or empathy — and use an evaluator LLM to pick the better response for each pair.
Tools and Models Used
This tutorial demonstrates the approach using OpenAI's GPT-4.1 and Google Gemini 2.5 Pro as the contestant models, with GPT-5 acting as the judge. The example task is a simple customer support email reply, but the method generalizes to other generation tasks.
Example Customer Context
We use a short customer message describing a wrong shipment as the test context:
Dear Support,
I ordered a wireless mouse last week, but I received a keyboard instead.
Can you please resolve this as soon as possible?
Thank you,
John
Installing Dependencies
Install the Python packages required for generation and evaluation:
pip install deepeval google-genai openaiYou will need API keys for both OpenAI and Google to run the examples. Generate them from the respective platforms before proceeding.
Setting API Keys in the Environment
Set your API keys securely in the environment (the example uses getpass to avoid storing keys in plain text):
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')
os.environ['GOOGLE_API_KEY'] = getpass('Enter Google API Key: ')Defining the Test Context and Prompt
Build a context variable containing the customer's email and a prompt that instructs models to write a reply:
from deepeval.test_case import ArenaTestCase, LLMTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval
context_email = """
Dear Support,
I ordered a wireless mouse last week, but I received a keyboard instead.
Can you please resolve this as soon as possible?
Thank you,
John
"""
prompt = f"""
{context_email}
--------
Q: Write a response to the customer email above.
"""Generating a Response with OpenAI (GPT-4.1)
Call the OpenAI API to generate the model's reply. The example shows a simple chat completion wrapper:
from openai import OpenAI
client = OpenAI()
def get_openai_response(prompt: str, model: str = "gpt-4.1") -> str:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
openAI_response = get_openai_response(prompt=prompt)Generating a Response with Gemini (Gemini 2.5 Pro)
Use Google's GenAI client to produce the Gemini response:
from google import genai
client = genai.Client()
def get_gemini_response(prompt, model="gemini-2.5-pro"):
response = client.models.generate_content(
model=model,
contents=prompt
)
return response.text
geminiResponse = get_gemini_response(prompt=prompt)Setting Up the Arena Test Case
Create an ArenaTestCase that includes both model outputs for the same input and context. These will be compared by the evaluator LLM:
a_test_case = ArenaTestCase(
contestants={
"GPT-4": LLMTestCase(
input="Write a response to the customer email above.",
context=[context_email],
actual_output=openAI_response,
),
"Gemini": LLMTestCase(
input="Write a response to the customer email above.",
context=[context_email],
actual_output=geminiResponse,
),
},
)Defining the Evaluation Metric (ArenaGEval)
Define a metric that instructs the judge (GPT-5) what to look for. In this example the metric emphasizes empathy, professionalism, and clarity. Verbose logging is enabled for insight into the evaluator's reasoning:
metric = ArenaGEval(
name="Support Email Quality",
criteria=(
"Select the response that best balances empathy, professionalism, and clarity. "
"It should sound understanding, polite, and be succinct."
),
evaluation_params=[
LLMTestCaseParams.CONTEXT,
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
model="gpt-5",
verbose_mode=True
)Running the Evaluation
Call the metric to evaluate the test case. The evaluator will compare the two responses and return a winner along with verbose logs if enabled:
metric.measure(a_test_case)Sample verbose output from the evaluator may look like this:
**************************************************
Support Email Quality [Arena GEval] Verbose Logs
**************************************************
Criteria:
Select the response that best balances empathy, professionalism, and clarity. It should sound understanding,
polite, and be succinct.
Evaluation Steps:
[
"From the Context and Input, identify the user's intent, needs, tone, and any constraints or specifics to be
addressed.",
"Verify the Actual Output directly responds to the Input, uses relevant details from the Context, and remains
consistent with any constraints.",
"Evaluate empathy: check whether the Actual Output acknowledges the user's situation/feelings from the
Context/Input in a polite, understanding way.",
"Evaluate professionalism and clarity: ensure respectful, blame-free tone and concise, easy-to-understand
wording; choose the response that best balances empathy, professionalism, and succinct clarity."
]
Winner: GPT-4
Reason: GPT-4 delivers a single, concise, and professional email that directly addresses the context (acknowledges
receiving a keyboard instead of the ordered wireless mouse), apologizes, and clearly outlines next steps (send the
correct mouse and provide return instructions) with a polite verification step (requesting a photo). This best
matches the request to write a response and balances empathy and clarity. In contrast, Gemini includes multiple
options with meta commentary, which dilutes focus and fails to provide one clear reply; while empathetic and
detailed (e.g., acknowledging frustration and offering prepaid labels), the multi-option format and an over-assertive claim of already locating the order reduce professionalism and succinct clarity compared to GPT-4.
======================================================================
Interpreting the Results
In the example run, GPT-4 was chosen as the winner because its reply was concise, polite, and action-oriented — it apologized, verified the issue, and outlined clear next steps. The Gemini output, while empathetic and thorough, presented multiple reply options and additional commentary that reduced focus and succinctness.
This Arena-as-a-Judge workflow helps you compare models on criteria you care about and provides human-readable reasoning when verbose mode is enabled. The pattern is straightforward to extend to more models, different criteria, or additional test cases.
For full source code, configuration details, and further examples, see the original project's repository and linked resources.
Сменить язык
Читать эту статью на русском