Arena как судья: как сравнить ответы LLM в очной дуэли

Что такое подход Arena-as-a-Judge

Подход Arena-as-a-Judge оценивает ответы больших языковых моделей через очные сравнения — вместо отдельных числовых оценок. Вы задаёте критерии (например, эмпатия, ясность, тон), а модель-судья выбирает лучший вариант ответа для каждой пары.

Какие инструменты используются

В этом примере в качестве конкурентов используются GPT-4.1 (OpenAI) и Gemini 2.5 Pro (Google), а в роли судьи — GPT-5. Задача — ответ на служебное письмо клиента, но метод применим к любым задачам генерации.

Контекст пользователя

Используется короткое письмо клиента о неправильной доставке:

Dear Support,
I ordered a wireless mouse last week, but I received a keyboard instead. 
Can you please resolve this as soon as possible?
Thank you,
John

Установка зависимостей

Установите необходимые Python-пакеты:

pip install deepeval google-genai openai

Понадобятся API-ключи OpenAI и Google — получите их на соответствующих платформах перед запуском.

Настройка ключей API в окружении

Пример безопасного задания ключей через getpass, чтобы не хранить их в открытом виде:

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')
os.environ['GOOGLE_API_KEY'] = getpass('Enter Google API Key: ')

Определение контекста и подсказки

Создайте переменную с письмом клиента и сформируйте подсказку для генерации ответа:

from deepeval.test_case import ArenaTestCase, LLMTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval
 
context_email = """
Dear Support,
I ordered a wireless mouse last week, but I received a keyboard instead. 
Can you please resolve this as soon as possible?
Thank you,
John
"""
 
prompt = f"""
{context_email}
--------
 
Q: Write a response to the customer email above.
"""

Генерация ответа через OpenAI (GPT-4.1)

Пример вызова OpenAI API для получения ответа модели:

from openai import OpenAI
client = OpenAI()
 
def get_openai_response(prompt: str, model: str = "gpt-4.1") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content
 
openAI_response = get_openai_response(prompt=prompt)

Генерация ответа через Gemini (Gemini 2.5 Pro)

Пример вызова Google GenAI клиента для получения ответа от Gemini:

from google import genai
client = genai.Client()
 
def get_gemini_response(prompt, model="gemini-2.5-pro"):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text
geminiResponse = get_gemini_response(prompt=prompt)

Настройка теста Arena

Создайте ArenaTestCase с двумя ответами на один и тот же запрос и контекст — эти ответы будут сравниваться судьёй:

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="Write a response to the customer email above.",
            context=[context_email],
            actual_output=openAI_response,
        ),
        "Gemini": LLMTestCase(
            input="Write a response to the customer email above.",
            context=[context_email],
            actual_output=geminiResponse,
        ),
    },
)

Определение метрики оценки (ArenaGEval)

Опишите, на что должна смотреть модель-судья. В примере упор на эмпатию, профессионализм и ясность. Включён verbose режим для получения пояснений судьи:

metric = ArenaGEval(
    name="Support Email Quality",
    criteria=(
        "Select the response that best balances empathy, professionalism, and clarity. "
        "It should sound understanding, polite, and be succinct."
    ),
    evaluation_params=[
        LLMTestCaseParams.CONTEXT,
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
    model="gpt-5",  
    verbose_mode=True
)

Запуск оценки

Вызовите метод измерения для сравнения ответов. Судья выберет победителя и, при включённом verbose, предоставит подробные логи:

metric.measure(a_test_case)

Пример verbose-вывода судьи:

**************************************************
Support Email Quality [Arena GEval] Verbose Logs
**************************************************
Criteria:
Select the response that best balances empathy, professionalism, and clarity. It should sound understanding, 
polite, and be succinct. 
 
Evaluation Steps:
[
    "From the Context and Input, identify the user's intent, needs, tone, and any constraints or specifics to be 
addressed.",
    "Verify the Actual Output directly responds to the Input, uses relevant details from the Context, and remains 
consistent with any constraints.",
    "Evaluate empathy: check whether the Actual Output acknowledges the user's situation/feelings from the 
Context/Input in a polite, understanding way.",
    "Evaluate professionalism and clarity: ensure respectful, blame-free tone and concise, easy-to-understand 
wording; choose the response that best balances empathy, professionalism, and succinct clarity."
] 
 
Winner: GPT-4
 
Reason: GPT-4 delivers a single, concise, and professional email that directly addresses the context (acknowledges 
receiving a keyboard instead of the ordered wireless mouse), apologizes, and clearly outlines next steps (send the 
correct mouse and provide return instructions) with a polite verification step (requesting a photo). This best 
matches the request to write a response and balances empathy and clarity. In contrast, Gemini includes multiple 
options with meta commentary, which dilutes focus and fails to provide one clear reply; while empathetic and 
detailed (e.g., acknowledging frustration and offering prepaid labels), the multi-option format and an over-assertive claim of already locating the order reduce professionalism and succinct clarity compared to GPT-4.
======================================================================

Что означают результаты

В примере GPT-4 был признан победителем благодаря краткому, вежливому и ориентированному на действие ответу: извинение, подтверждение проблемы и чёткое описание шагов решения. Gemini показал эмпатию и детали, но формат с несколькими вариантами и добавочным комментарием снизил фокус и краткость.

Подобный Arena-процесс позволяет сравнивать модели по важным для вас критериям и получать мотивированные объяснения выбора судьи, что упрощает принятие решений о том, какой модельный вывод лучше подходит для конкретной задачи.

Для полного кода и дополнительных примеров смотрите оригинальный репозиторий и прикреплённые ресурсы.