AbstRaL: Boosting LLM Robustness with Abstract Reasoning and Reinforcement Learning

Challenges in LLM Reasoning Robustness

Recent studies reveal that large language models (LLMs), especially smaller ones, often struggle with consistent reasoning across variations of problems. While they perform well on familiar questions, slight alterations such as changing names, numbers, or adding irrelevant details cause their accuracy to drop significantly. This problem, known as poor out-of-distribution (OOD) generalization, limits the reliability of LLMs even on straightforward math tasks.

Abstract Reasoning as a Solution

A promising approach to improve robustness is teaching LLMs to focus on the underlying logic of problems rather than surface details. Synthetic variations of reasoning problems are created to help models learn abstract reasoning patterns. This is critical for building more generalizable and dependable AI systems.

The AbstRaL Method

Researchers from Apple and EPFL introduced AbstRaL, a novel method that employs reinforcement learning to teach LLMs abstract reasoning. Unlike traditional data augmentation, which is computationally expensive, AbstRaL trains models to recognize and apply symbolic reasoning patterns. It connects abstract patterns to symbolic tools, enabling more consistent and context-independent problem-solving.

Four Key Steps in AbstRaL

Symbolic Variable Replacement: Key variables in questions are identified and replaced with symbolic placeholders.
Training with GranulAR Data: Models learn step-by-step reasoning using specially designed abstract symbolic data called GranulAR.
Abstract Structure Retrieval: The model extracts the general reasoning structure from the symbolic representation.
Answer Computation: The abstraction is combined with original values to compute the correct answer.

Reinforcement learning optimizes two rewards: one for answer correctness and another for symbolic similarity, enhancing the model's ability to generate accurate and generalizable reasoning patterns.

Robustness on GSM Benchmarks

AbstRaL was evaluated on GSM8K math reasoning tasks using models like Llama-3 and Qwen2. By training on GranulAR, models focus on problem structure instead of superficial features. Tests on altered GSM8K problems — with changed numbers, names, and phrasing — showed that AbstRaL outperforms standard Chain-of-Thought prompting methods. It maintains higher accuracy and consistency, particularly benefiting smaller LLMs. These results demonstrate that abstract reasoning training fosters adaptability and reduces reliance on memorized patterns.

Implications for Future AI Development

AbstRaL highlights the effectiveness of reinforcement learning combined with symbolic abstraction in enhancing LLM reasoning robustness. Its approach surpasses traditional fine-tuning or data augmentation by teaching models to disregard surface distractions and focus on core logic. This method paves the way for more reliable and generalizable AI capable of tackling diverse and variable real-world problems.