Small RLM Predicts Kernel Latency, Memory and Model Accuracy Directly from Code
A single model that turns code into numbers
Researchers from Cornell and Google present a Regression Language Model (RLM) that reads raw text representations of code, Triton kernels, and ONNX graphs and directly predicts numeric metrics such as GPU kernel latency, program memory usage, and neural network accuracy and latency. The approach removes the need for hand-engineered features, graph encoders, or bespoke proxies by treating regression as next-token prediction over numeric strings.
What the model predicts
The RLM is trained to output multiple numeric targets from a single input string. Example targets include:
- Peak memory usage for high-level programs written in Python, C, C++ and other languages.
- Latency for Triton GPU kernels (measured on specific devices such as the RTX A6000).
- Accuracy and hardware-specific latency for architectures exported to ONNX.
Outputs are emitted as digit tokens (sign, exponent, mantissa) with constrained decoding to ensure valid numerals and to support calibrated uncertainty via sampling.
Key performance highlights
The paper reports strong rank correlations across heterogeneous tasks:
- APPS LeetCode memory prediction: Spearman ρ > 0.9.
- Triton kernel latency (A6000): ρ ≈ 0.52.
- CodeNet memory across 17 languages: average ρ > 0.5, with C/C++ around 0.74–0.75.
- NAS ranking across several search spaces: average Kendall τ ≈ 0.46.
These results are competitive with, and in some cases surpass, graph-based predictors and other baselines.
Architecture and training
The backbone is an encoder–decoder initialized from a T5-Gemma encoder (about 300M parameters). Inputs are plain text strings (code, Triton IR, ONNX text). The decoder autoregressively emits numeric tokens under constrained decoding rules to guarantee valid numeric formats. Key ablations show:
- Language pretraining speeds convergence and improves Triton latency prediction.
- Autoregressive numeric emission outperforms standard MSE regression heads even after target normalization.
- Learned tokenizers tailored to ONNX operators increase effective context capacity.
- Longer input contexts and scaling the Gemma encoder provide further gains with proper tuning.
The regress-lm library implements the training and constrained decoding utilities, and the Code-Regression dataset bundles the benchmarks needed to reproduce experiments.
Why this matters
Current performance-prediction pipelines often rely on fragile, hand-designed features, or graph neural networks tightly coupled to particular IRs or operators. By standardizing regression as text-to-number generation, the RLM simplifies the stack: tokenize source artifacts as plain text, fine-tune the same model for new metrics or hardware, and decode calibrated numerics digit-by-digit. This reduces maintenance cost and eases transfer across languages, kernels, and hardware targets.
Datasets and tooling
The authors release a Code-Regression dataset that unifies APPS/LeetCode memory runs, KernelBook-derived Triton latencies, and CodeNet memory footprints. A NAS/ONNX suite exports architectures from NASBench, FBNet, Once-for-All and others to ONNX text to predict accuracy and per-device latency. The regress-lm library provides recipes for multi-task pretraining and fine-tuning as well as constrained decoding implementations.
Practical takeaways
A single ~300M-parameter T5Gemma-initialized RLM can predict memory, kernel latency, and model accuracy+latency directly from textual code representations without feature engineering. The approach produces high-quality rank correlations that are useful for compiler heuristics, kernel selection/pruning, and multi-objective NAS triage, while lowering the barrier to adaptation across new hardware and languages.