TransEvalnia: Advanced LLM-Powered Translation Evaluation with Human-Like Precision

Cutting-Edge Translation Evaluation with LLMs

Translation systems driven by large language models (LLMs) have progressed to the point where they can sometimes outperform human translators. As these models tackle more complex tasks like document-level or literary translation, accurately measuring improvements becomes more challenging. Traditional metrics such as BLEU remain in use but lack explanatory power behind their scores.

Beyond Numerical Scores: Detailed Quality Assessments

With translation quality nearing human-level performance, users demand evaluations that go beyond simple numerical indicators. They seek transparency through reasoning across key dimensions including accuracy, terminology, and audience suitability. This detailed feedback helps users understand evaluation results, spot errors, and make better-informed decisions.

Emerging Metrics and Rationale-Based Evaluation

While BLEU has long been the standard for assessing machine translation, its relevance diminishes as modern systems match or surpass human translators. More recent metrics like BLEURT, COMET, and MetricX fine-tune powerful language models to better judge translation quality. Large models such as GPT and PaLM2 are capable of zero-shot or structured evaluations and can even generate MQM-style feedback. Pairwise comparison techniques further improve alignment with human judgments. Studies show that prompting models to explain their reasoning enhances decision quality, yet such rationale-based methods remain underexploited in machine translation evaluation.

Introducing TransEvalnia: Prompting-Based Translation Evaluation

Researchers at Sakana.ai developed TransEvalnia, a system that uses prompting-based reasoning with LLMs to evaluate and rank translations. It delivers detailed feedback based on selected MQM dimensions, ranks translations, and scores them on a 5-point Likert scale, including an overall rating. Tested with LLMs like Claude 3.5 and Qwen-2.5, TransEvalnia’s judgments strongly align with human ratings, performing competitively or better than leading models like MT-Ranker across multiple language pairs such as English-Japanese and Chinese-English.

Methodology and Bias Mitigation

TransEvalnia evaluates translations span by span across vital quality aspects: accuracy, terminology, audience suitability, and clarity. For poetic texts like haikus, emotional tone replaces standard grammar checks. Scores range from 1 to 5 and translations are ranked accordingly. To address position bias, the research compares several evaluation strategies including single-step, two-step, and an interleaving method, with interleaving showing the lowest bias. A no-reasoning method was also tested but found to lack transparency and to be prone to bias. Human experts reviewed samples to compare system judgments with professional evaluations.

Performance and Comparative Analysis

The team benchmarked TransEvalnia against MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL using datasets with human scores. On WMT-2024 English-Spanish data, MT-Ranker performed best due to extensive training data. However, TransEvalnia matched or outperformed MT-Ranker on most other datasets, with Qwen’s no-reasoning approach winning on WMT-2023 English-German. Position bias analysis showed interleaved methods consistently had the lowest bias scores. Human raters awarded Sonnet the highest overall Likert scores (4.37–4.61), and its evaluations correlated well with human judgments (Spearman’s R approximately 0.51–0.54).

Open Data and Future Directions

The researchers have openly shared all data, reasoning outputs, and code to foster further research. Fine-tuning Qwen significantly improved performance, and addressing position bias remains a core focus for improving ranking systems in translation evaluation.

For more details, see the original paper and explore tutorials on AI Agent and Agentic AI applications. Follow Sakana.ai on Twitter and join the 100k+ member ML SubReddit for ongoing updates.