Group Think: Revolutionizing Multi-Agent Collaboration for Faster LLM Reasoning
MediaTek Research introduces Group Think, a novel token-level multi-agent paradigm that enables concurrent reasoning in large language models, significantly speeding up inference and enhancing collaborative problem-solving.
Collaborative Reasoning in Large Language Models
Large language models (LLMs) are increasingly being developed to work collaboratively through multi-agent systems. These systems divide complex problems into tasks handled simultaneously by different agents, aiming to boost efficiency and reduce latency, especially in real-time applications.
Challenges in Existing Multi-Agent Systems
Most current collaborative LLM systems operate on sequential, turn-based communication, which slows down processing since agents wait for others to finish reasoning steps. This approach also leads to duplicated efforts and inconsistent outputs because agents cannot observe peers' evolving thoughts during generation. Such latency and redundancy limit the practical use of multi-agent LLMs on devices with constrained time and computational resources.
Limitations of Current Reasoning Techniques
Techniques like Chain-of-Thought prompting structure problem-solving but increase inference time. More advanced methods like Tree-of-Thoughts and Graph-of-Thoughts expand reasoning paths but lack real-time mutual adaptation among agents. Collaborative multi-agent systems mostly rely on alternating message exchanges, introducing delays. Complex dynamic scheduling or role-based configurations have been proposed but are not optimized for efficient inference.
Introducing Group Think: Token-Level Concurrent Reasoning
MediaTek Research presents Group Think, a novel method allowing multiple reasoning agents within a single LLM to operate concurrently. Agents can observe each other's partial outputs at the token level during generation, adapting their reasoning in real-time. This reduces duplication and lets agents shift focus if another agent is better suited to continue a given reasoning thread.
Group Think uses a token-level attention mechanism, enabling each agent to attend to tokens generated by all agents. Each agent is assigned a unique sequence of token indices, interleaved in memory and stored in a shared cache accessible during generation. This supports efficient cross-agent attention without modifying the transformer architecture. The method works effectively both on personal devices and data centers, utilizing idle compute by batching multiple agent outputs, even at batch size one, and processing multiple requests together with correct attention dynamics.
Performance and Experimental Results
Group Think significantly improves latency and output quality. In enumeration tasks, like listing 100 distinct names, it outperforms Chain-of-Thought approaches, achieving near-complete results faster. Latency reduction scales with the number of agents; for instance, four agents cut latency roughly fourfold. In divide-and-conquer problems, such as applying the Floyd–Warshall algorithm on a five-node graph, four agents halved completion time compared to a single agent. Group Think also excels in code generation tasks, producing correct code segments more rapidly than baseline models when using four or more agents.
Emergent Collaborative Behavior and Future Potential
Despite no explicit training for collaboration, LLMs under Group Think demonstrate emergent group reasoning, naturally diversifying tasks to avoid redundancy by dividing work by topic or focus area. These promising results indicate that performance could further improve with dedicated training on collaborative datasets.
For full details, check out the original research paper. Follow updates on Twitter, join the 95k+ ML SubReddit, and subscribe to the newsletter for more insights.
Сменить язык
Читать эту статью на русском