Enhancing Multilingual Reasoning in English-Centric RLMs Through Test-Time Scaling

The Challenge of Multilingual Reasoning in RLMs

Reasoning language models (RLMs) are designed to simulate step-by-step problem-solving by generating detailed reasoning chains, improving performance in complex tasks such as mathematics and logic. However, despite many large models being multilingual, their training and research predominantly focus on English. This creates challenges for reasoning in other languages, especially those with limited training data, leading to poorer output quality and reasoning errors due to linguistic differences.

Limitations of Current Approaches

Most RLMs are fine-tuned primarily on English datasets, limiting their reasoning capabilities in other languages. Techniques like zero-shot or few-shot prompting often rely on English as a pivot language, which can cause inconsistencies. Smaller models show minimal improvements, and even large models struggle with low-resource languages. The gap between training languages and reasoning languages remains a significant barrier.

Research on Test-Time Scaling

Researchers from Brown University and MBZUAI explored how increasing test-time computation, specifically by extending reasoning chains, impacts multilingual reasoning in English-centric RLMs. Using s1 models based on the Qwen2.5-Instruct architecture fine-tuned on 1,000 English STEM reasoning samples, they evaluated performance across multiple languages with benchmarks such as MGSM and Global-MMLU.

Key Findings

Larger models notably benefit from increasing the number of reasoning tokens during testing. The 14B s1 model, scaled up to 8,000 thinking tokens, achieved an average accuracy of 81% on non-English MGSM tasks, outperforming Qwen2.5-14B-Instruct by +23.1% in French and +41.6% in Swahili. Despite being trained only on English data, it outperformed even larger models like DeepSeek’s R1-Distill-Qwen-32B in several high-resource languages.

Language Efficiency and Behavior

Reasoning was more efficient and accurate in high-resource languages such as Chinese and English, requiring fewer tokens. The model exhibited a "quote-and-think" behavior, quoting non-English prompt phrases but reasoning internally in English, demonstrating its multilingual comprehension without direct translation. Forcing reasoning in high-resource languages yielded better accuracy, whereas enforcing reasoning in low-resource languages caused accuracy drops and inefficiencies.

Domain Generalization Limitations

While test-time scaling improved performance in STEM tasks, it failed to generalize to domains like cultural commonsense or humanities. In benchmarks such as FORK, increasing reasoning tokens sometimes harmed performance due to overthinking.

Conclusion

Test-time scaling enhances multilingual reasoning for English-centric RLMs mainly in high-resource languages and STEM domains. However, it does not generalize well to low-resource languages or out-of-domain tasks, highlighting the need for more balanced multilingual training and domain adaptation research.

For further details, check out the original paper and follow ongoing updates on AI research communities and newsletters.