Apriel-1.5-15B-Thinker: Frontier Multimodal Reasoning on a Single GPU
What Apriel-1.5-15B-Thinker is
ServiceNow AI Research Lab published Apriel-1.5-15B-Thinker, a 15 billion parameter multimodal reasoning model released with open weights under an MIT license on Hugging Face. The model emphasizes reproducibility and on-premise deployability by fitting a checkpoint on a single GPU while delivering frontier-level composite performance across diverse third-party evaluations.
Architecture and scaling strategy
Apriel builds from Mistral’s Pixtral-12B-Base-2409 multimodal decoder-vision stack and applies depth upscaling to expand decoder layers from 40 to 48. The team then realigns the projection network to match the enlarged decoder to the vision encoder, avoiding from-scratch pretraining and preserving the single-GPU deployment target.
Mid-training recipe: continual pretraining then SFT
The training pipeline is data-centric and proceeds in two mid-training phases without reinforcement learning or preference optimization:
CPT (Continual Pretraining): Two stages were used. First, a mixed text and image stage to build foundational reasoning and document and diagram understanding. Second, targeted synthetic visual tasks such as reconstruction, matching, detection, and counting to sharpen spatial and compositional capabilities. Sequence lengths were extended to 32k and 16k tokens respectively, with selective loss placement on response tokens for instruction-formatted samples.
SFT (Supervised Fine-Tuning): High-quality instruction data covering math, coding, science, tool use, and instruction following. Two additional SFT runs, one over a stratified subset and another with longer context, were weight-merged into the final checkpoint. No RL or RLAIF was used.
A data note: roughly 25% of the text used in the depth-upscaling mix comes from NVIDIA’s Nemotron collection.
Evaluation and benchmark performance
Apriel reports an Artificial Analysis Intelligence Index or AAI of 52, a composite that aggregates ten third-party evaluations including MMLU-Pro, GPQA Diamond, AIME 2025, LiveCodeBench, SciCode, IFBench, and others. Despite being dramatically smaller than some state of the art models, Apriel matches composite scores such as DeepSeek-R1-0528 while offering significant cost savings.
Reported task-level results include:
- AIME 2025 (mathematics): about 87.5–88% pass@1
- GPQA Diamond: around 71% accuracy
- IFBench (instruction following): roughly 62
- τ²-Bench Telecom: about 68
- LiveCodeBench (functional correctness): ~72.8
Using VLMEvalKit for reproducibility, Apriel scores competitively across multimodal and math-focused suites such as MMMU, LogicVista, MathVision, MathVerse, MMStar, CharXiv, AI2D, and BLINK, with particularly strong results on documents, diagrams, and text-dominant math imagery.
Practical implications
Apriel’s combination of open weights, a reproducible training recipe, and a single-GPU checkpoint makes it a practical baseline for enterprises and researchers who need on-premise or air-gapped deployments with fixed memory and latency budgets. The model is intended as a cost-efficient, transparent option to evaluate before considering larger closed systems.
Where to find it
The weights, training recipe, and evaluation protocol are publicly available on Hugging Face under an MIT license for independent verification and experimentation.
Hugging Face model page: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker Research PDF: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf