QwenLong-L1: Advancing Long-Context Reasoning in Large Language Models with Reinforcement Learning

Challenges in Long-Context Reasoning for Large Language Models

Large reasoning models (LRMs) have demonstrated impressive abilities in handling short-context reasoning tasks using reinforcement learning (RL). However, these successes do not easily translate to long-context scenarios where the input sequences exceed 100,000 tokens. Applications such as multi-document question answering, research synthesis, and legal or financial analysis require deep reasoning over extensive text. RL optimization in these long-context settings faces challenges including slower reward convergence, unstable policy updates due to fluctuations in KL divergence, and reduced exploration caused by entropy collapse. These issues highlight a key difficulty in extending LRMs from short to long-context proficiency.

Introducing QwenLong-L1: A Structured Reinforcement Learning Framework

To overcome these limitations, the Qwen Research team presents QwenLong-L1, a novel reinforcement learning framework specifically designed for long-context reasoning adaptation. The framework consists of three main stages:

Warm-up Supervised Fine-Tuning (SFT): This stage initializes the policy model by training it on curated question-context-answer triplets, establishing foundational skills in understanding context and extracting answers.
Curriculum-Guided Phased Reinforcement Learning: The model undergoes staged training with progressively longer context lengths, allowing it to gradually develop long-context reasoning capabilities while maintaining stable policy updates.
Difficulty-Aware Retrospective Sampling: This technique improves exploration by preserving and reusing difficult examples from earlier phases, weighted by their difficulty level, which promotes more robust reasoning across diverse inputs.

Hybrid reward mechanisms combine rule-based exact match verification with semantic evaluation by a lightweight language model to ensure both precision and recall during policy training.

Technical Innovations and Methodology

QwenLong-L1 leverages recent advancements in group-relative reinforcement learning optimization, such as GRPO and DAPO, to reduce computational costs associated with long-context value estimation:

GRPO (Group-Relative Policy Optimization): Normalizes rewards within sampled groups to estimate advantage without requiring a separate value network, encouraging diverse output patterns.
DAPO (Dynamic Adaptive Policy Optimization): Employs dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and reduce length biases during training.

The reward function integrates two signals: a deterministic rule-based exact match and a semantic judgment from a compact evaluator model (e.g., Qwen2.5-1.5B). This hybrid approach prevents overfitting to rigid formats while preserving answer correctness across different notations and phrasings.

The framework uses progressive context scaling, moving from 20,000-token inputs to 60,000-token inputs in phases, which stabilizes training dynamics and supports policy generalization.

Benchmark Performance and Experimental Results

QwenLong-L1 was tested on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32-billion parameter model, QwenLong-L1-32B, showed strong results:

It outperformed baseline models like R1-Distill-Qwen-32B by 5.1 points.
It exceeded proprietary systems such as OpenAI-o3-mini and Qwen3-235B-A22B.
Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning under extreme context lengths.

Pass@K analysis demonstrated consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview even at low sampling rates.

Ablation Studies and Emergent Reasoning Behaviors

Ablation experiments confirmed the importance of each component: supervised fine-tuning, phased RL, and retrospective sampling. Reinforcement learning was especially crucial for enabling emergent reasoning skills such as grounding, subgoal setting, verification, and backtracking — capabilities not effectively induced by supervised fine-tuning alone.