Tackling Over-Refusal in Language Models: The FalseReject Dataset

The Challenge of Over-Refusal in Language Models

Many advanced language models today tend to err on the side of caution, often refusing to respond to prompts that seem risky but are actually harmless. This over-refusal behavior limits their usefulness in practical applications, where nuanced understanding is crucial.

Introducing FalseReject: A Targeted Dataset

Researchers from Dartmouth College and Amazon created the FalseReject dataset, a large collection of prompts designed to pinpoint and reduce unnecessary refusals. These prompts appear sensitive at first glance but are verified as safe in context. The dataset covers 44 safety-related categories and includes 16,000 prompts, along with human-annotated test sets for evaluation.

How FalseReject Was Created

The team identified language patterns that trigger refusals and extracted entity graphs from existing safety datasets using Llama-3.1-405B. They employed a multi-agent framework where a Generator creates prompts, a Discriminator assesses their risk, and an Orchestrator validates their safety and usefulness. This iterative process refined prompts to ensure they challenge models without promoting harmful content.

Benchmarking Language Models

FalseReject-Test benchmarked 29 language models, including GPT, Claude, Gemini, Llama, Mistral, Cohere, Qwen, Phi, and DeepSeek. Results revealed persistent over-refusal issues, even in top-performing models like GPT-4.5 and Claude-3.5. Interestingly, some open-source models such as Mistral-7B and DeepSeek-R1 outperformed closed-source counterparts, suggesting open communities can achieve competitive safety alignment.

Fine-Tuning with FalseReject

Combining FalseReject with general instruction tuning data, researchers fine-tuned several base models. Non-reasoning models improved their constructive responses, and reasoning models enhanced both caution and relevance without sacrificing general performance. This demonstrates that fine-tuning on FalseReject effectively mitigates over-refusal.

Balancing Safety and Usefulness

While FalseReject advances the ability of language models to handle sensitive topics more intelligently, the fundamental challenge remains: designing filters that balance moral, legal, and practical considerations in a rapidly evolving environment. This work provides a valuable resource and methodology to improve contextual safety and reduce excessive refusals in AI systems.

Further Information

The FalseReject dataset and related resources are available online, including a Hugging Face dataset exploration page and a research paper titled "FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning."