RA3: Temporal Action Abstractions to Speed Up RL Post-Training in Code LLMs

Overview

A new research paper from Apple formalizes what mid-training should accomplish before reinforcement learning (RL) post-training in code large language models (LLMs). The authors introduce RA3 (Reasoning as Action Abstractions), an EM-style procedure that discovers temporally consistent latent actions from expert traces and then fine-tunes the model on those bootstrapped traces. The core claim is that effective mid-training both prunes the action space to a compact, near-optimal subset and shortens the effective planning horizon, which together improve RL convergence.

What mid-training should do

The paper breaks mid-training effects into two determinants:

The analysis argues mid-training is most effective when the decision space is compact and the effective horizon is short. This favors learning temporal abstractions (higher-level actions spanning multiple tokens) over relying solely on primitive next-token actions.

The RA3 algorithm (one pass)

RA3 derives a sequential variational lower bound, described as a temporal ELBO, and optimizes it with an EM-like loop:

This one-pass procedure discovers persistent, temporally extended actions from demonstrations and then fine-tunes the model so those actions become part of the model’s predictive behavior.

Empirical results on code generation and RLVR

On Python code tasks across multiple base models, RA3 improves average pass@k on HumanEval and MBPP by approximately +8 and +4 points respectively compared to the base model and an NTP mid-training baseline. When used to initialize post-training RLVR, RA3 leads to faster convergence and higher final performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

These gains reflect both mid-training benefits (better priors via abstractions) and improved post-training dynamics (faster, more stable RL optimization within a pruned action space).

Key takeaways

For full technical details, see the paper on arXiv: https://arxiv.org/pdf/2509.25810