OpenAI Releases gpt-oss-safeguard: Open-Weight Policy-Conditioned Safety Reasoners
OpenAI published a research preview of gpt-oss-safeguard, two open-weight models that apply developer-supplied policies at inference time; the 120B and 20B models are available on Hugging Face under Apache 2.0
What gpt-oss-safeguard Is
OpenAI published a research preview of gpt-oss-safeguard, a pair of open-weight safety reasoning models designed to let developers apply custom safety policies at inference time. The release includes two sizes: gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. Both are fine-tuned from gpt-oss, licensed under Apache 2.0, and available on Hugging Face for local deployment.
Policy-conditioned safety instead of fixed moderation
Traditional moderation models are trained against a single, fixed policy. When policies change, the usual response is retraining or swapping models. gpt-oss-safeguard inverts that flow: the developer supplies the policy as part of the model input along with the user content. The model then reasons step-by-step whether the content violates the supplied policy. Treating safety as a prompt-and-evaluation task makes the approach more adaptable to fast-changing or domain-specific harms such as fraud, biological risks, self-harm, or game-specific abuse.
Mirror of OpenAI's internal Safety Reasoner
OpenAI describes gpt-oss-safeguard as an open-weight implementation of the Safety Reasoner they use internally across systems like GPT 5, ChatGPT Agent, and Sora 2. In production, OpenAI typically runs fast, small high-recall filters first, and escalates ambiguous or sensitive items to a reasoning model. In recent launches, up to 16 percent of compute was spent on safety reasoning. The open release lets external teams reproduce this defense-in-depth pattern instead of guessing at how OpenAI's stack operates.
Model sizes and hardware targets
- gpt-oss-safeguard-120b: ~117B parameters with 5.1B active parameters; sized to fit on a single 80GB H100-class GPU.
- gpt-oss-safeguard-20b: ~21B parameters with 3.6B active parameters; targets lower-latency use cases and smaller GPUs, including 16GB setups.
Both models are trained on the harmony response format, so prompts must follow that structure to avoid degraded results. The Apache 2.0 license permits commercial local deployment, matching the parent gpt-oss models.
Evaluation highlights
OpenAI evaluated the models on internal multi-policy tests and public datasets. In multi-policy accuracy tasks, where the model must correctly apply multiple policies simultaneously, gpt-oss-safeguard and OpenAI's internal Safety Reasoner outperform gpt-5-thinking and open gpt-oss baselines. On the 2022 moderation dataset the new models slightly outperform both gpt-5-thinking and the internal Safety Reasoner, though OpenAI cautions the margin is not statistically significant. On ToxicChat the internal Safety Reasoner still leads, with gpt-oss-safeguard close behind. Overall, the open models are competitive for real moderation tasks.
Recommended deployment pattern
Running full reasoning on every request is computationally expensive. OpenAI recommends a layered pipeline: use small, fast, high-recall classifiers on all traffic, and route only uncertain or sensitive items to gpt-oss-safeguard. For fast user-facing flows, run the reasoner asynchronously. This mirrors OpenAI's production guidance and acknowledges that task-specific classifiers can be more cost-effective when strong labeled datasets exist.
Practical takeaways
- gpt-oss-safeguard provides policy-conditioned safety reasoning so policy updates do not require retraining.
- The models implement the same Safety Reasoner pattern used internally by OpenAI and are sized for real deployments.
- Both models maintain the harmony response format and are available under Apache 2.0 on Hugging Face, enabling local commercial deployment.
- Evaluation places the models in a competitive range for moderation, with nuanced differences versus internal systems that are not always statistically significant.
- OpenAI recommends using these models as part of a layered moderation stack alongside community resources and fast classifiers.
OpenAI's open-weight release makes an internal safety pattern reproducible, enabling platforms to apply custom taxonomies, audit chain-of-thought reasoning, and update policies without changing model weights.
Сменить язык
Читать эту статью на русском