Unbabel Launches TOWER+: The Breakthrough Multilingual LLM for Accurate Translation and Instruction Following

Advancing Machine Translation with Large Language Models

Large language models have significantly advanced machine translation by utilizing extensive training data to translate numerous languages and dialects, capturing subtle linguistic nuances. However, fine-tuning these models to improve translation accuracy often compromises their instruction-following and conversational abilities. General-purpose models struggle to meet professional standards for translation fidelity, especially when balancing cultural nuances, code generation, problem-solving, and user-specific formatting. Maintaining terminological consistency and formatting across diverse audiences presents further challenges. Enterprises demand adaptable systems that cater to domain-specific needs and user preferences without losing fluency.

Challenges in Tailoring Language Models for Translation

Various strategies have been employed to enhance language models for translation accuracy. Fine-tuning on parallel corpora improves translation adequacy and fluency, while continued pretraining on monolingual and parallel data boosts multilingual fluency. Reinforcement learning with human feedback helps align outputs with quality preferences. Proprietary models like GPT-4o and Claude 3.7 lead in translation quality, while open-weight models such as TOWER V2 and GEMMA 2 have matched or outperformed closed-source models in certain languages. These efforts highlight the ongoing challenge of balancing precise translation and broad language capabilities.

Introducing TOWER+: Unified Training for Translation and General Tasks

Unbabel, in collaboration with Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, and MICS, CentraleSupélec, Université Paris-Saclay, introduced TOWER+, a family of models at scales of 2B, 9B, and 72B parameters. The goal was to find an optimal balance between translation specialization and general-purpose utility. Using a unified training pipeline, TOWER+ models aim to achieve high translation performance alongside robust instruction-following and conversational skills, supporting diverse applications.

TOWER+ Training Pipeline

The training process includes:

Continued pretraining on curated data comprising monolingual content, filtered parallel sentences formatted as translation instructions, and a small share of instruction-like examples.
Supervised fine-tuning combining translation tasks with diverse instruction-following scenarios such as code generation, math problem-solving, and question answering.
Preference optimization through weighted preference optimization and group-relative policy updates using off-policy signals and human-edited translation variants.
Reinforcement learning with verifiable rewards based on regex checks and preference annotations to ensure compliance with explicit instructions during translation. This pipeline balances specialized translation accuracy with versatile language proficiency.

Benchmark Performance Highlights

The TOWER+ 9B model achieved a 33.47% win rate on multilingual chat prompts and an 84.38 XCOMET-XXL score across 24 language pairs, surpassing similar open-weight models. The flagship 72B parameter model scored 54.52% on M-ArenaHard, 89.02 on IFEval instruction-following, and 83.29 XCOMET-XXL on WMT24++, establishing new open-weight benchmarks. The combined translation and instruction-following benchmark IF-MT showed 5.55 for instruction adherence and 88.95 for translation fidelity, confirming TOWER+ as a state-of-the-art solution for both enterprise and research.

Technical Summary

Models at 2B, 9B, and 72B parameters explore the trade-off between translation and general tasks.
The pipeline integrates continued pretraining (66% monolingual, 33% parallel, 1% instruction), supervised fine-tuning (22.3% translation), weighted preference optimization, and reinforcement learning.
Training covers 27 languages and dialects, with 47 language pairs and over 32 billion tokens.
Benchmarks show competitive or superior results compared to GPT-4O-1120, Claude-Sonnet-3.7, ALMA-R, GEMMA-2, and LLAMA-3.3.
The approach reduces model proliferation and operational overhead by unifying translation and conversational capabilities.

TOWER+ offers a scalable, Pareto-optimal framework for future translation-focused large language models.

For more details, see the original research paper and models provided by the team.