Bridging the Gap Between Knowing and Doing: How Google DeepMind Enhances LLM Decision-Making with Reinforcement Learning

The Challenge of Decision-Making in Large Language Models

Large language models (LLMs) trained on extensive datasets have shown remarkable abilities in language understanding and generation. Beyond language tasks, these models have potential as decision-making agents in interactive environments. However, despite their capacity for accurate reasoning, LLMs often struggle to translate their knowledge into effective actions. This phenomenon is known as the knowing-doing gap. Additionally, models face issues like greediness—prematurely opting for high-reward actions—and frequency bias, where smaller models favor common actions at the expense of exploration.

Existing Strategies and Their Limitations

Traditional reinforcement learning techniques, such as bandit algorithms including Upper-Confidence Bound (UCB), attempt to balance exploration and exploitation. Approaches like in-context learning and behavior cloning imitate expert behavior but tend to reinforce existing decision biases. These methods have only marginally improved decision-making performance and lack a reliable mechanism to convert internal reasoning into optimal actions, especially in complex or stochastic environments.

Reinforcement Learning Fine-Tuning (RLFT) Approach

Researchers from Google DeepMind and the LIT AI Lab at JKU Linz introduced a novel method called Reinforcement Learning Fine-Tuning (RLFT) to refine language model behavior. RLFT leverages self-generated Chain-of-Thought (CoT) rationales as training signals. The model evaluates rewards based on actions taken following specific reasoning steps, learning to prefer decisions that are both logically sound and yield high practical returns. This technique effectively links model reasoning to environmental feedback, reducing the gap between thought and behavior.

Methodology Details

The RLFT process involves token-based fine-tuning through environment interactions. At each step, the model receives an instruction and recent action-reward history, then generates a sequence containing its rationale and chosen action. These outputs are assessed based on environmental rewards and adherence to the desired format, with penalties applied for invalid actions. Reward shaping encourages consistent formatting and exploration.

Monte Carlo baseline estimates and generalized advantage estimation are used to handle variable-length tasks like Tic-tac-toe, enabling the model to learn from diverse decision sequences.

Significant Performance Improvements

RLFT significantly enhanced decision-making in tested models. In a multi-armed bandit environment with 10 options, action coverage for a 2-billion parameter model rose from 40% to over 52% after 30,000 updates. Frequency bias decreased from 70% to 35%. In Tic-tac-toe, the 2B model's win rate against a random opponent jumped from 15% to 75%, and it achieved draws against an optimal Monte Carlo Tree Search agent, improving average returns from -0.95 to 0.0. Larger 27B models, which generated correct rationales 87% of the time, initially chose the optimal action only 21% of the time; RLFT greatly reduced this discrepancy.

Implications for Future AI Agents

This research highlights the importance of connecting reasoning and action in LLMs to build reliable decision-making agents. By reinforcing successful behaviors and addressing common decision errors, RLFT offers a practical path toward more capable and autonomous AI systems based on large language models.

For more details, check out the original paper and follow the researchers' updates on Twitter and the ML SubReddit community.