Unbabel Launches TOWER+: A Breakthrough Unified Model for High-Quality Translation and Instruction-Following
Unbabel, along with researchers from Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, and MICS, CentraleSupélec, Université Paris-Saclay, has introduced TOWER+, a unified framework for high-fidelity translation and instruction-following in multilingual large language models (LLMs). This advancement addresses a longstanding challenge in the field: balancing precise, culturally aware translations with the broader capabilities needed for code generation, problem-solving, and maintaining user-specific formatting. Current Challenges in AI Translation Large language models have made significant strides in machine translation, capable of handling numerous languages and dialects while capturing nuanced linguistic details. However, fine-tuning these models for translation accuracy often compromises their instruction-following and conversational abilities. Conversely, general-purpose models struggle to meet the high standards of professional translation, failing to maintain terminological consistency and adhere to strict formatting guidelines. Benchmarks like WMT24++ and IFEval highlight the gap between specialized translation quality and general-purpose versatility, which has been a critical bottleneck for enterprise adoption. Introducing TOWER+ To tackle these issues, the research team developed TOWER+, available in three sizes: 2 billion, 9 billion, and 72 billion parameters. The goal is to explore the trade-off between translation specialization and general-purpose utility, positioning TOWER+ models on the Pareto frontier—achieving optimal performance in both areas without one compromising the other. TOWER+ Training Pipeline The TOWER+ training pipeline consists of four key stages: Continued Pretraining: This stage uses a carefully curated dataset that includes monolingual content, filtered parallel sentences formatted as translation instructions, and a small fraction of instruction-like examples. The process covers 27 languages and dialects, 47 language pairs, and over 32 billion tokens. Supervised Fine-Tuning: The model is refined using a combination of translation tasks and diverse instruction-following scenarios, such as code generation, mathematical problem-solving, and question-answering. Weighted Preference Optimization: This stage employs group-relative policy updates trained on off-policy signals and human-edited translation variants, ensuring the model adheres to preferred outputs. Reinforcement Learning with Verifiable Rewards: Using regex-based checks and preference annotations, this final stage reinforces the model’s ability to follow explicit instructions during translation, thereby enhancing its precision and reliability. Benchmark Results The TOWER+ suite has shown impressive results: 9B Model: Achieved a 33.47% win rate on multilingual general chat prompts, an XCOMET-XXL score of 84.38 across 24 language pairs, and a combined IF-MT score of 4.85 for instruction adherence and 88.51 for translation fidelity. 72B Model: Recorded a 54.52% win rate on M-ArenaHard, an IFEval instruction-following score of 89.02, and an XCOMET-XXL level of 83.29 across the full WMT24++ benchmark. The flagship model also earned an IF-MT score of 5.55 for instruction and 88.95 for translation, setting a new open-weight standard. 2B Model: Despite its smaller size, the 2B variant performed competitively, achieving a 6.33% win rate on M-ArenaHard and an 87.65% IF-MT translation quality. When benchmarked against models like GPT-4O-1120, Claude-Sonnet-3.7, ALMA-R, GEMMA-2, and LLAMA-3.3, TOWER+ consistently matches or outperforms them on both specialized and general tasks. Key Technical Insights Parameter Variants: The TOWER+ models range from 2 billion to 72 billion parameters, each designed to explore different aspects of the performance frontier. Data Curation: Continued pretraining involves a balanced dataset with 66% monolingual, 33% parallel, and 1% instruction-like content. Multilingual Coverage: The models cover a wide range of languages and dialects, ensuring versatility. Unified Checkpoints: The integration of specialized and general checkpoints helps maintain balance in capabilities. Reproducibility: The research provides a detailed, reproducible method for building LLMs that excel in both translation and conversational tasks. Evaluation and Industry Impact Industry experts praise the TOWER+ framework for its innovative approach to integrating translation and instruction-following capabilities. This development reduces the need for separate models, streamlining operations and enhancing efficiency. The scalable nature of TOWER+ makes it a promising solution for enterprises and research institutions that require high-fidelity translations alongside robust general language skills. Unbabel, known for its pioneering work in AI-powered translation solutions, continues to push the boundaries of what is possible in multilingual LLMs. The TOWER+ models represent a significant step forward in achieving the elusive balance between specialized translation accuracy and general-purpose utility, potentially transforming how companies deploy AI in multilingual settings. For more detailed insights, you can check out the research paper and access the models. Credit for this groundbreaking research goes to the dedicated team of researchers involved in the project. Follow Unbabel on Twitter and join their ML SubReddit to stay updated on the latest advancements in AI translation.