RA3: Temporal Action Abstractions to Speed Up RL Post-Training in Code LLMs
Overview
A new research paper from Apple formalizes what mid-training should accomplish before reinforcement learning (RL) post-training in code large language models (LLMs). The authors introduce RA3 (Reasoning as Action Abstractions), an EM-style procedure that discovers temporally consistent latent actions from expert traces and then fine-tunes the model on those bootstrapped traces. The core claim is that effective mid-training both prunes the action space to a compact, near-optimal subset and shortens the effective planning horizon, which together improve RL convergence.
What mid-training should do
The paper breaks mid-training effects into two determinants:
- Pruning efficiency: how well mid-training selects a compact near-optimal action subset that shapes the model’s initial policy prior.
- RL convergence: how quickly post-training improves policy performance restricted to that subset.
The analysis argues mid-training is most effective when the decision space is compact and the effective horizon is short. This favors learning temporal abstractions (higher-level actions spanning multiple tokens) over relying solely on primitive next-token actions.
The RA3 algorithm (one pass)
RA3 derives a sequential variational lower bound, described as a temporal ELBO, and optimizes it with an EM-like loop:
- E-step (latent discovery): use RL to infer temporally consistent latent structures or abstractions that align with expert sequences.
- M-step (model update): perform next-token prediction on the bootstrapped, latent-annotated traces to integrate those abstractions into the model’s policy.
This one-pass procedure discovers persistent, temporally extended actions from demonstrations and then fine-tunes the model so those actions become part of the model’s predictive behavior.
Empirical results on code generation and RLVR
On Python code tasks across multiple base models, RA3 improves average pass@k on HumanEval and MBPP by approximately +8 and +4 points respectively compared to the base model and an NTP mid-training baseline. When used to initialize post-training RLVR, RA3 leads to faster convergence and higher final performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.
These gains reflect both mid-training benefits (better priors via abstractions) and improved post-training dynamics (faster, more stable RL optimization within a pruned action space).
Key takeaways
- The paper formalizes mid-training effects via pruning efficiency and RL convergence, showing both matter for downstream RL success.
- RA3 operationalizes temporal action abstractions through a temporal ELBO optimized with an EM-style loop: RL-driven latent discovery followed by next-token fine-tuning on bootstrapped traces.
- Empirically, RA3 shows consistent improvements in code generation benchmarks and accelerates RLVR convergence and asymptotic performance when used as initialization for post-training.
For full technical details, see the paper on arXiv: https://arxiv.org/pdf/2509.25810