Tina: USC's Tiny Models Deliver Big Advances in Cost-Effective Reinforcement Learning

Challenges in Multi-Step Reasoning with Language Models

Achieving robust multi-step reasoning in language models (LMs) remains a significant hurdle despite improvements in general task performance. This capability is essential for tackling complex domains like scientific research and strategic planning. Traditionally, fine-tuning models for enhanced reasoning involves supervised fine-tuning (SFT), where models learn step-by-step reasoning by imitating advanced models such as o1. However, this relies heavily on costly, high-quality reasoning demonstrations and risks encouraging superficial mimicry rather than true logical reasoning.

Reinforcement Learning as an Alternative

Reinforcement Learning (RL) enables models to learn directly from reward signals, fostering broader reasoning exploration. Despite its potential, RL methods often require substantial computational resources and complexity, posing challenges for cost-effective deployment.

Advances in Efficient Reasoning Models

Following powerful models like o1-preview, open-source projects such as STILL, Sky-T1, SimpleRL, PRIME, and DeepScaleR have explored lightweight imitation learning, scalable instruction tuning, and simplified RL to replicate or surpass o1’s reasoning abilities. Innovations like Group Relative Policy Optimization (GRPO) improve RL efficiency by removing the need for separate value networks, as utilized in DeepSeek-R1.

Leveraging Low-Rank Adaptation (LoRA) for Efficient Fine-Tuning

To reduce training costs, Low-Rank Adaptation (LoRA) methods update only a small subset of model parameters, preserving modularity and reasoning capabilities without the overhead of full-parameter tuning.

Introducing Tina: Compact Reasoning Models from USC

USC researchers present Tina, a set of compact reasoning models built on a 1.5B parameter base model using RL enhanced by LoRA. Tina models outperform or match state-of-the-art counterparts at a fraction of the computational cost. Their top model improves reasoning by over 20% and achieves 43.33% Pass@1 accuracy on AIME24, with a post-training expense of only $9.

Training and Evaluation Setup

Tina models utilize the DeepSeek-R1-Distill-Qwen-1.5B model fine-tuned with LoRA during RL using a GRPO-style approach. The training emphasizes minimalism—tiny models, small parameter updates, and low hardware and budget requirements. Training was conducted on publicly available datasets, replicating setups from STILL-3, DeepScaleR, and Open-RS, using just two NVIDIA L40S GPUs and occasionally RTX 6000 Ada GPUs. Each training experiment cost under $100.

Benchmarking and Performance

To ensure fair comparisons, baseline reasoning models were reevaluated using the LightEval framework and vLLM engine, eliminating inconsistencies from prior studies. Six benchmarks, including AIME 24/25, AMC 23, MATH 500, GPQA, and Minerva, were used. Tina models, trained with minimal epochs (19–57%), often outperformed full-parameter models. Ablation studies confirmed the importance of dataset quality, learning rates, LoRA ranks, and RL algorithm choice in optimizing performance.

Open-Source and Accessibility

All code, logs, and model checkpoints are open-sourced, promoting accessible research and further advancements in reasoning models.

For more details, check out the paper and GitHub repository. Follow updates on Twitter, Telegram, LinkedIn, and join the ML SubReddit community.