<RETURN_TO_BASE

New Dataset Teaches AI to Admit Uncertainty, Reducing Hallucinations in Language Models

Researchers developed the SUM dataset to teach AI language models to say 'I don't know,' significantly reducing hallucinations and improving refusal rates without harming accuracy on answerable problems.

Addressing the Hallucination Problem in Reinforcement Finetuning

Reinforcement finetuning improves large language models (LLMs) by using reward signals to encourage desirable, logical, and structured outputs. However, these models often struggle to indicate uncertainty when faced with incomplete or ambiguous queries. Instead of refusing to answer, they produce confidently incorrect responses—a problem termed the "hallucination tax." This issue is particularly concerning in applications requiring high trust and precision.

Limitations of Current Training Approaches

Most reinforcement finetuning frameworks reward correct answers and penalize wrong ones but do not properly incentivize refusal behavior. As a result, models become overconfident and rarely refuse to answer unclear questions. Studies show refusal rates dropping near zero after standard reinforcement finetuning across multiple models.

Introduction of the Synthetic Unanswerable Math (SUM) Dataset

Researchers at the University of Southern California created the SUM dataset to tackle this challenge. SUM generates implicitly unanswerable math problems by altering existing questions—removing key information or introducing logical inconsistencies—while keeping them plausible. Using DeepScaleR as a base and the o3-mini model to generate these questions, SUM trains models to respond with "I don't know" when questions cannot be answered reliably.

Training Strategy and Results

By mixing 10% SUM data with regular training data, models learn to evaluate uncertainty and refuse to answer when appropriate without compromising accuracy on solvable problems. For example, the Qwen2.5-7B model's refusal rate increased dramatically on benchmarks like SUM (from 0.01 to 0.73) and UMWP (from 0.01 to 0.81), and refusal accuracy rose on the SelfAware dataset (from 0.01 to 0.94). Similar improvements were observed for the Llama-3.1-8B-Instruct model. Meanwhile, performance on answerable datasets such as GSM8K and MATH-500 remained stable, with negligible accuracy drops.

Implications for AI Trustworthiness

This approach demonstrates a critical trade-off: while reinforcement finetuning can suppress cautious refusal behavior, integrating a small portion of unanswerable data like SUM helps models recognize their own limitations. Teaching AI to admit uncertainty fosters more trustworthy, honest, and reliable systems, marking a significant advancement in AI development.

Additional Resources

The research paper and dataset are available on Hugging Face. Full credit goes to the University of Southern California research team behind this project.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский