WINGS Architecture: Solving Text-Only Forgetting in Multimodal Large Language Models

Multimodal LLMs: Bridging Text and Vision

Large language models (LLMs) are expanding beyond text to include image understanding, creating multimodal LLMs (MLLMs) that can seamlessly interpret and respond to both visual and textual inputs. This advancement enhances AI applications across fields such as education, content creation, and interactive digital assistants.

The Problem of Text-Only Forgetting

Introducing visual data into language models introduces a challenge known as text-only forgetting. When image tokens are embedded within text sequences, the model’s attention shifts towards visual information, resulting in degraded performance on purely textual tasks like reading comprehension, reasoning, and text-based question answering.

Existing Solutions and Their Drawbacks

Current methods to mitigate text-only forgetting include retraining on large volumes of text-only data, alternating between text-only and multimodal fine-tuning, and integrating adapters or prompt-based tuning layers. However, these approaches often increase training complexity, computational costs, or fail to fully restore the model’s textual understanding due to the fundamental shift in attention caused by visual tokens.

Introducing WINGS: Dual Learners for Balanced Attention

Researchers at Alibaba and Nanjing University proposed WINGS, a novel architecture that incorporates two modules—visual and textual learners—into each layer of the MLLM. These modules operate alongside the core attention mechanism, resembling wings attached to attention layers. A routing system dynamically manages the allocation of attention between visual and textual learners based on the input token composition, maintaining a balanced focus.

Efficient Attention with Low-Rank Residual Attention (LoRRA)

WINGS utilizes Low-Rank Residual Attention (LoRRA) to keep computations efficient while capturing modality-specific features. Training proceeds in two stages: first activating visual learners for image feature alignment, then co-training both learners with a router module that assigns attention weights. Each learner processes either visual or textual data through lightweight attention blocks, and their outputs integrate with the main model to prevent visual dominance over text understanding.

Impressive Performance Gains

WINGS demonstrated significant improvements on several benchmarks. It increased text-only scores on MMLU by 9.70 points to 60.53 and on CMMLU by 9.36 points to 69.82. Reasoning benchmarks like Race-High and WSC saw increases of 11.9 and 11.12 points respectively. On multimodal datasets such as MMMU-VAL, WINGS improved by 4.78 points and showed superior handling of mixed text-image dialogues on the IIT benchmark compared to other open-source MLLMs.

WINGS sets a new standard for balanced, efficient multimodal large language models by effectively addressing text-only forgetting through its dual-learner architecture and attention routing.