Google DeepMind Launches QuestBench to Test LLMs on Spotting Missing Info in Reasoning Tasks

Challenges of Reasoning with Incomplete Information

Large language models (LLMs) have shown strong performance in reasoning tasks such as mathematics, logic, planning, and coding. However, real-world scenarios often involve incomplete or ambiguous information, unlike the idealized tasks that assume all necessary data is provided upfront. Users may omit important details, and autonomous systems like robots face partial observability, creating a gap that requires LLMs to proactively gather missing information by identifying gaps and asking clarifying questions.

Existing Approaches and Their Limitations

Previous methods tackling information gathering include active learning techniques like Bayesian optimization, reinforcement learning, and planning in partially observable states. Research in natural language ambiguity explores semantic uncertainties and task-oriented dialogues, while LLM question-asking strategies use prompting, information gain, and multi-stage clarification. However, most benchmarks focus on subjective tasks with multiple valid clarifications, making objective evaluation difficult. They address ambiguity rather than underspecified reasoning problems where a single correct clarifying question exists.

Introducing QuestBench: A New Benchmark

QuestBench offers a rigorous evaluation of LLMs' ability to identify and acquire missing information in reasoning tasks by framing them as Constraint Satisfaction Problems (CSPs). These are problems where the target variable cannot be solved without additional information. QuestBench focuses on "1-sufficient CSPs," where knowing one unknown variable's value suffices to solve the problem.

The benchmark covers three domains:

Logic-Q: logical reasoning
Planning-Q: block world planning with partial initial states
GSM-Q/GSME-Q: grade-school math problems in verbal and equation forms

Problems are classified by difficulty along four axes: number of variables, number of constraints, search depth, and expected guesses by brute-force search. This classification helps analyze LLMs' reasoning strategies and limitations.

How QuestBench Works

A CSP is defined as a tuple ⟨X, D, C, A, y⟩ where X are variables, D their domains, C constraints, A assignments, and y the target variable. The "Known" predicate indicates if a variable's value is determinable. Underspecified CSPs are those where y can't be found with current info. QuestBench focuses on problems solvable by learning one additional variable.

Models are evaluated on their ability to identify the single sufficient variable to solve the target, combining gap recognition and strategic constraint reasoning.

Experimental Results

Tests on LLMs including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro/Flash, Gemini 2.0 Flash Thinking Experimental, and open-source Gemma models were conducted in zero-shot, chain-of-thought, and four-shot formats. Results show that high search depths and complex constraints pose significant challenges. Chain-of-thought prompting improved performance, highlighting the benefit of explicit reasoning.

Gemini 2.0 Flash Thinking Experimental achieved the highest accuracy, especially in planning tasks. Open-source models performed well in logic tasks but struggled with complex math requiring deeper search.

Insights and Future Directions

QuestBench reveals that while current LLMs can handle simple algebraic problems, they face difficulties with complex logic and planning when missing information is involved. Increasing problem complexity correlates with reduced performance. Effective reasoning is necessary but not sufficient; better recognition of information gaps and clarification question generation remain key areas for advancement.