Revolutionizing AI Evaluation: The Power of Contextualized Queries

The Challenge of Ambiguous Queries in Language Models

Users often submit questions to language models without providing enough detail, which complicates understanding their true intent. For example, a question like “What book should I read next?” depends heavily on personal taste, while “How do antibiotics work?” requires a tailored explanation based on the user's background knowledge. Traditional evaluation methods frequently ignore this missing context, leading to inconsistent or unfair assessments of model responses. Responses that seem acceptable on the surface, such as praising coffee, might be unhelpful or even harmful depending on the user's specific circumstances.

Prior Approaches to Handling Context and Personalization

Previous research has concentrated on generating clarification questions to resolve ambiguities in tasks like Q&A, dialogue systems, and information retrieval. These approaches aim to better capture user intent. Similarly, studies on instruction-following and personalization stress the importance of customizing responses according to user attributes such as expertise, age, or stylistic preferences. Some research has also explored how models adapt to diverse contexts and proposed training techniques to enhance this adaptability. Moreover, language model-based evaluators, while efficient, can exhibit bias, prompting efforts to improve fairness through clearer evaluation criteria.

Introducing Contextualized Evaluations

A team of researchers from the University of Pennsylvania, the Allen Institute for AI, and the University of Maryland, College Park, introduced a method called contextualized evaluations. This technique enriches underspecified queries by adding synthetic context in the form of follow-up question-answer pairs to clarify ambiguous inputs during language model evaluation. Their findings indicate that including context can profoundly affect evaluation results, sometimes even reversing model rankings, and improves agreement among evaluators. This approach reduces reliance on superficial features like style and uncovers biases toward WEIRD (Western, Educated, Industrialized, Rich, Democratic) contexts in default model responses. It also highlights varying model sensitivities to different user contexts.

Methodology and Impact

The researchers developed a straightforward framework to assess how language models perform with clearer, context-rich queries. They selected underspecified queries from popular benchmark datasets and enhanced them with follow-up Q&A pairs simulating user-specific contexts. Responses from various language models were gathered, and both human and model-based evaluators compared answers under two conditions: with only the original query and with the added context. This setup allowed measurement of the impact of context on model rankings, evaluator agreement, and judgment criteria, offering a practical way to evaluate models on real-world ambiguity.

Key Findings

Adding context such as user intent or audience significantly improves model evaluation. Inter-rater agreement increased by 3–10%, and in some cases, model rankings were reversed—for instance, GPT-4 outperformed Gemini-1.5-Flash only when context was provided. Without context, evaluations tend to focus on tone or fluency, but with context, the emphasis shifts towards accuracy and helpfulness. Default model outputs often reflect Western, formal, and general audience biases, limiting their effectiveness for diverse users. Ignoring context in benchmarks risks producing unreliable results. To ensure fairness and relevance, evaluations must incorporate context-rich prompts alongside scoring rubrics that reflect real user needs.

Implications for Future AI Evaluation

Many user queries lack crucial context like intent or expertise, making evaluation subjective and unreliable. Contextualized evaluations enrich queries with relevant follow-up questions and answers, shifting focus towards meaningful criteria such as helpfulness and revealing hidden biases. Although this study uses a limited set of context types and partly relies on automated scoring, it strongly supports the adoption of more context-aware evaluation methods in future research.