AegisLLM: Revolutionizing LLM Security with Adaptive Multi-Agent Systems at Inference
AegisLLM introduces a dynamic multi-agent system that improves LLM security during inference by continuously adapting to evolving threats without retraining.
The Escalating Threats to Large Language Models
Large Language Models (LLMs) face increasing risks from sophisticated attacks such as prompt injections, jailbreaking attempts, and unauthorized extraction of sensitive data. Traditional defense mechanisms relying on static protections during training phases fall short due to the dynamic and evolving nature of these threats. Static filters are easily bypassed by subtle adversarial changes, and training-time interventions often fail to address new vulnerabilities discovered post-deployment. Additionally, techniques like machine unlearning do not guarantee complete removal of sensitive knowledge, leaving models vulnerable to leakage. Most current security strategies concentrate on training-time measures with limited focus on real-time or system-level defenses.
Limitations of Current Security Approaches
Methods like Reinforcement Learning from Human Feedback (RLHF) and safety fine-tuning improve model alignment during training but offer limited resilience to unforeseen attacks after deployment. Additional systems-level guardrails and red-teaming strengthen protection but remain fragile against adversarial manipulations. Attempts to unlearn unsafe behaviors show potential but cannot fully suppress unwanted knowledge. While multi-agent architectures excel in managing complex tasks, their application in LLM security at inference remains underexplored. Existing agentic optimization techniques such as TEXTGRAD, OPTO, and DSPy focus on iterative refinement and prompt optimization but have not been systematically used to enhance security during inference.
Introducing AegisLLM: An Adaptive Multi-Agent Security Framework
AegisLLM, developed by researchers from the University of Maryland, Lawrence Livermore National Laboratory, and Capital One, presents a novel approach to securing LLMs through a cooperative, inference-time multi-agent system. This framework deploys autonomous LLM-powered agents working together to continuously detect, analyze, and mitigate adversarial threats. Key components include the Orchestrator, Deflector, Responder, and Evaluator agents. By leveraging automated prompt optimization and Bayesian learning, AegisLLM refines its defenses dynamically without requiring costly retraining of the base model. This design ensures real-time adaptability to new attack vectors while maintaining the model’s utility.
Coordinated Agent Pipeline and Automated Prompt Optimization
AegisLLM’s operation relies on a coordinated pipeline where each agent fulfills a specialized role governed by system prompts encoding their behavior. Manually crafted prompts often underperform in complex security tasks, prompting the system to perform iterative automated optimization of these prompts. During each iteration, batches of queries are tested against candidate prompt configurations, enabling the system to maximize the effectiveness of each agent’s role in threat detection and mitigation.
Performance Evaluation on Security Benchmarks
Testing on the WMDP benchmark with Llama-3-8B shows AegisLLM achieving near-minimal accuracy on restricted topics, indicating strong filtering capabilities. On the TOFU benchmark, it demonstrates near-perfect flagging performance across multiple models including Llama-3-8B, Qwen2.5-72B, and DeepSeek-R1, with Qwen2.5-72B reaching almost 100% accuracy. In defending against jailbreaking attacks, AegisLLM maintains robust protection while responding appropriately to legitimate inputs, evidenced by a competitive 0.038 StrongREJECT score and an 88.5% compliance rate, all without extensive retraining.
Shifting Paradigms: From Static Defenses to Dynamic Agentic Coordination
AegisLLM redefines LLM security as an emergent property from coordinated, specialized agents operating at inference time rather than a static characteristic embedded during training. This shift addresses the shortcomings of previous methods by enabling scalable, adaptive security capable of responding in real time to evolving threats. As language models grow more powerful, frameworks like AegisLLM will be essential for responsible and secure AI deployment.
For more details, check out the Paper and GitHub Page. All credit goes to the dedicated researchers behind this project.
Сменить язык
Читать эту статью на русском