SwiReasoning: Entropy-Guided Switching Between Latent Thought and Explicit Chain-of-Thought
What SwiReasoning does
SwiReasoning is a decoding-time framework that lets a reasoning LLM decide dynamically when to reason silently in latent space and when to produce an explicit chain-of-thought (CoT). The controller observes block-wise trends in next-token entropy and uses that signal as a confidence estimate. When entropy rises and confidence falls, the model enters a latent reasoning block and continues internal computation without emitting tokens. When entropy drops and confidence recovers, the model switches back to explicit CoT and emits tokens to consolidate its chosen path.
Inference-time controller and switching mechanism
The central mechanism is a simple, training-free controller that monitors next-token entropy over decoder steps to form a block-wise confidence signal. Switching is governed by entropy trends rather than learned policy: upward entropy trends trigger latent phases to broaden exploration, while downward trends trigger explicit phases to commit to a solution. A maximum switch count parameter bounds the number of transitions between latent and explicit blocks, preventing excessive oscillation or prolonged silent wandering.
Empirical results on math and STEM benchmarks
SwiReasoning reports consistent gains across mathematics and STEM reasoning tasks. With unlimited decoding budgets (Pass@1), it achieves average accuracy lifts in the +1.5%–2.8% range and up to +2.8% on some math evaluations. Under constrained token budgets, the method demonstrates major token-efficiency improvements, reporting average gains between +56% and +79% and outperforming standard CoT variants in 13 of 15 budgeted evaluations. On AIME 2024/2025 tests, SwiReasoning reaches maximum reasoning accuracy significantly earlier than CoT, indicating faster convergence with fewer sampled trajectories.
Why alternating helps
Explicit CoT is discrete and human-readable but can prematurely lock the model into a single reasoning path, discarding useful alternatives. Purely latent reasoning allows richer, continuous exploration but can diffuse probability mass and slow convergence. SwiReasoning combines both: latent phases expand exploration when confidence is low, and explicit phases exploit rising confidence to solidify decisions. The capped switch count regularizes transitions, reducing overthinking and token waste while addressing accuracy loss from diffusion.
Comparison to baselines and practical implications
Benchmarks compare SwiReasoning against CoT with sampling, CoT greedy, and Soft Thinking. The method shifts the Pareto frontier outward, delivering either higher accuracy for the same token budget or similar accuracy with fewer tokens. Its value proposition centers on accuracy-per-token rather than raw state-of-the-art scores, making it especially relevant for budgeted inference and large-scale batching. The reported open-source BSD implementation exposes flags like –max_switch_count and –alpha, enabling straightforward replication and stacking with complementary efficiency techniques such as quantization or speculative decoding.
Key takeaways
- Training-free entropy-based controller alternates latent and explicit CoT using block-wise next-token entropy trends.
- Large token-efficiency gains under constrained budgets, typically +56%–79% over CoT.
- Moderate accuracy lifts at unlimited budgets, roughly +1.5%–2.8% on math/STEM suites.
- Faster convergence on tasks like AIME 2024/2025, achieving peak accuracy with fewer samples.
For details and implementation, see the paper and project page linked in the original release.