NVIDIA Launches Open-Source Safety Framework to Protect Agentic AI Systems

The Rise of Agentic AI and Its Challenges

Large language models (LLMs) are evolving beyond simple text generation to become agentic AI systems capable of planning, reasoning, and autonomous action. While this advancement unlocks powerful automation opportunities for enterprises, it also introduces significant risks such as goal misalignment, prompt injection attacks, unintended behaviors, data leakage, and diminished human oversight.

NVIDIA’s Comprehensive Safety Solution

To address these challenges, NVIDIA has developed and open-sourced a complete safety software suite and a post-training safety recipe designed to secure agentic AI systems throughout their lifecycle.

Lifecycle-Wide Protection

The safety recipe covers all stages from evaluation before deployment, through post-training alignment, to continuous protection during operation:

Evaluation: Pre-deployment testing against enterprise policies, security requirements, and trust thresholds using open datasets and benchmarks.
Post-Training Alignment: Enhancing models with Reinforcement Learning (RL), Supervised Fine-Tuning (SFT), and on-policy dataset blending to ensure safety compliance.
Continuous Protection: Deployment of NVIDIA NeMo Guardrails and real-time monitoring microservices that actively block unsafe outputs and defend against prompt injections and jailbreak attempts.

Core Components and Technologies

| Stage | Technology/Tools | Purpose | |------------------------|----------------------------------------|-------------------------------------------| | Pre-Deployment Evaluation | Nemotron Content Safety Dataset, WildGuardMix, garak scanner | Test safety and security | | Post-Training Alignment | RL, SFT, open-licensed data | Fine-tune safety and alignment | | Deployment & Inference | NeMo Guardrails, NIM microservices | Block unsafe behaviors | | Monitoring & Feedback | garak, real-time analytics | Detect and resist new attacks |

Open Datasets for Safety

Nemotron Content Safety Dataset v2: Screens for a wide range of harmful behaviors.
WildGuardMix Dataset: Focuses on content moderation for ambiguous and adversarial prompts.
Aegis Content Safety Dataset: Contains over 35,000 annotated samples enabling detailed safety filter development.

Open-Source Post-Training Recipe

NVIDIA’s safety recipe is available as an open-source Jupyter notebook or a cloud module. The workflow includes:

Initial model evaluation with safety and security benchmarks.
On-policy safety training using supervised fine-tuning and reinforcement learning.
Re-evaluation to confirm improvements.
Deployment with live monitoring and guardrails.

Quantitative Improvements

Content safety improved from 88% to 94% with no accuracy loss.
Product security against adversarial prompts increased from 56% to 63%.

Ecosystem Collaboration

NVIDIA partners with cybersecurity leaders like Cisco AI Defense, CrowdStrike, Trend Micro, and Active Fence to integrate continuous safety signals and incident-driven improvements.

Getting Started

The full safety recipe and tools are publicly available for download and cloud deployment. Enterprises can customize policies and iteratively harden their models to maintain trustworthiness amid evolving risks.

Explore NVIDIA’s AI safety recipe to safeguard your agentic AI systems with transparency and robustness.