Salesforce Unveils UAEval4RAG: Benchmarking RAG Systems on Rejecting Unanswerable Queries

Addressing the Challenge of Unanswerable Queries in RAG Systems

Retrieval-Augmented Generation (RAG) systems have advanced the ability to provide answers without extensive retraining of language models. However, existing evaluation methods primarily measure accuracy and relevance for answerable questions, overlooking the vital capability of these systems to reject queries they cannot answer. This shortfall poses significant risks in practical applications, where providing incorrect or misleading responses can cause harm.

Current benchmarks for unanswerable queries are insufficient for RAG systems because they often involve static, generic requests that do not reflect the specific knowledge base a system might access. Failures to reject unanswerable queries usually arise from retrieval errors rather than true recognition, revealing a critical gap in evaluation frameworks.

Existing Research and Limitations

Research into unanswerability has shed light on model noncompliance with ambiguous or underspecified questions. Evaluations of RAG systems have progressed with techniques like RAGAS and ARES assessing document relevance, and RGB and MultiHop-RAG focusing on output accuracy. Some recent benchmarks attempt to evaluate the rejection of unanswerable queries but generally rely on LLM-generated contexts and only test a narrow range of unanswerable types. This leaves the diversity and adaptability of rejections across different knowledge bases untested.

Introducing UAEval4RAG

Salesforce Research has introduced UAEval4RAG, a novel framework that generates datasets of unanswerable queries tailored to any external knowledge base and evaluates RAG systems automatically. It tests not only the response quality to answerable queries but also the ability to reject six categories of unanswerable requests: Underspecified, False-presuppositions, Nonsensical, Modality-limited, Safety Concerns, and Out-of-Database.

An automated pipeline creates diverse and challenging requests for each knowledge base. Evaluations use two LLM-based metrics — Unanswerable Ratio and Acceptable Ratio — to measure system performance.

Comprehensive Evaluation of RAG Components

UAEval4RAG assesses the impact of various RAG components including embedding models, retrieval methods, rewriting strategies, rerankers, different LLMs, and prompting techniques. Testing 27 configurations across four benchmarks revealed no single setup excels universally due to differing knowledge distributions. The choice of LLM matters significantly; for example, Claude 3.5 Sonnet improved correctness by 0.4% and the acceptable rejection ratio by 10.4% compared to GPT-4o.

Prompt design also influences outcomes strongly, with optimal prompts boosting unanswerable query rejection performance by 80%. The framework uses three metrics — Acceptable Ratio, Unanswered Ratio, and Joint Score — to comprehensively evaluate rejection capabilities.

Validation and Insights

UAEval4RAG demonstrates high effectiveness in generating unanswerable queries, achieving 92% accuracy and strong agreement among evaluators on TriviaQA and Musique datasets. LLM-based metrics showed robust accuracy and F1 scores across three LLMs, proving reliability regardless of the backbone model. Analysis highlights that prompt design affects hallucination control and query rejection, while dataset characteristics like modality and safety concerns influence performance.

Future Directions

The framework addresses a critical gap by focusing on unanswerable query rejection in RAG systems. Future improvements could integrate more diverse, human-verified data to enhance generalizability. Tailoring evaluation metrics to specific use cases might improve effectiveness further. Currently centered on single-turn interactions, extending the framework to multi-turn dialogues would better reflect real-world scenarios, where clarifying questions help manage ambiguous or underspecified queries.

For more details, check out the original research paper.

All credit goes to the Salesforce research team. Stay updated by following on Twitter, joining our 95k+ ML SubReddit, and subscribing to our newsletter.