CWM by Meta FAIR: A 32B Open-Weights LLM That Learns Code by Predicting Execution
What CWM is and why it matters
Meta FAIR released the Code World Model (CWM), a 32-billion-parameter dense, decoder-only transformer that injects world modeling into code generation. Instead of training only on static source text, CWM is mid-trained on execution traces and long-horizon agent–environment interactions to teach the model how program state evolves during execution.
How CWM learns code differently
CWM mid-training relies on two families of observation–action trajectories. First, Python interpreter traces that record local variable states after each executed line, enabling the model to learn semantics of state transitions. Second, agentic interactions captured inside Dockerized repositories that include edits, shell commands, and test feedback, teaching multi-step tool use and repository-level reasoning.
Data collection and ForagerAgent
To scale the dataset, the team built executable repository images from thousands of GitHub projects and used a software-engineering agent called ForagerAgent to forage multi-step trajectories. The release reports about 3 million trajectories across roughly 10k images and 3.15k repositories, including mutate-fix and issue-fix variants to expose realistic development workflows.
Model architecture and context window
CWM is a dense transformer with 64 layers, GQA (48Q/8KV), SwiGLU activations, RMSNorm, and Scaled RoPE positional encoding. Attention alternates between local 8k and global 131k sliding-window blocks, yielding an effective 131k-token context window. Training uses document-causal masking to support long-context reasoning.
Training recipe: pre → mid → post
- Pretraining: 8T tokens with a code-heavy corpus using an 8k context.
- Mid-training: an additional ~5T tokens at long-context (131k) featuring Python execution traces, ForagerAgent trajectories, PR-derived diffs, IR/compilers, Triton kernels, and Lean math data.
- Post-training: a 100B-token supervised fine-tune for instruction and reasoning, followed by multi-task RL (~172B-token) across verifiable coding, math, and multi-turn software-engineering environments using a GRPO-style algorithm and a minimal toolset (bash/edit/create/submit).
Quantized inference fits on a single 80 GB H100, making research-scale evaluation accessible.
Benchmarks and performance
The team reports competitive results for CWM on several benchmarks, including SWE-bench Verified (65.8% pass@1 with test-time scaling), LiveCodeBench-v5 (68.6%), LCB-v6 (63.5%), Math-500 (96.6%), AIME-24 (76.0%), AIME-25 (68.2%), and CruxEval-Output (94.3%). CWM is positioned as competitive with similar open-weights baselines and in some cases rivaling larger or closed models on coding tasks.
Operational capabilities enabled by world modeling
Two operational capabilities stand out:
- Execution-trace prediction: given a function and a trace start, CWM predicts stack frames and the executed line at each step in a structured format. This can function as a neural debugger for grounded reasoning without live execution.
- Agentic coding: multi-turn reasoning with tool use against real repositories, with verification via hidden tests and patch similarity rewards. The model is trained to localize faults and generate end-to-end patches (git diffs) rather than isolated snippets.
Practical notes and license
Tokenizer choices follow the Llama-3 family with reserved control tokens to demarcate trace and reasoning segments during SFT. The attention layout alternates local and global blocks in a 3:1 interleave repeated across depth. Compute and learning-rate schedules were tuned using internal scaling-law sweeps for long-context training overheads. Meta FAIR released intermediate and post-trained checkpoints under the FAIR Non-Commercial Research License, making CWM a reproducible platform for ablation studies on long-context, execution-aware code generation.
For full details, readers can consult the paper, the GitHub page, and the model on Hugging Face linked in the original release.