Stress-Testing OpenAI: Single-Turn Adversarial Attacks With deepteam

Overview

This tutorial shows how to test an OpenAI model against single-turn adversarial attacks using deepteam. deepteam offers 10+ attack methods — from simple prompt injection to obfuscation techniques like leetspeak and Base64 — and can apply attack enhancements to better mimic real-world malicious behavior. By running these attacks you can evaluate how well a model defends against various vulnerabilities.

Installing dependencies

Run the package installer to fetch deepteam, the OpenAI client, and pandas.

pip install deepteam openai pandas

Set your OPENAI_API_KEY as an environment variable before running red_team(), since deepteam uses LLMs to generate adversarial attacks and evaluate outputs. To get a key, visit https://platform.openai.com/settings/organization/api-keys and generate one. New users may need to add billing details and a minimum payment to activate API access.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Importing the libraries

Include asyncio, the OpenAI client, and deepteam components that orchestrate attacks and vulnerabilities.

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.attacks.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

Defining the model callback

You need a callback that queries the OpenAI model and returns its response text. This async function uses the chat completions API to call the chosen model and extracts the content of the first message in the response.

client = OpenAI()
 
# Define callback for querying the LLM
async def model_callback(input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # use "gpt-4o" if you want a stronger model
        messages=[{"role": "user", "content": input}],
    )
    return response.choices[0].message.content

Defining vulnerability and attacks

Here we specify a vulnerability (IllegalActivity) and tag it for a specific type (child exploitation). Then we create instances of different single-turn attack methods to test the model's resilience.

# Vulnerability
illegal_activity = IllegalActivity(types=["child exploitation"])
 
# Attacks
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

Single-turn attack types

Prompt Injection

Prompt injection tries to override the model's original instructions by injecting manipulative text into the prompt. In the example, attackers attempt to coerce the model into role-playing an unethical persona and bypassing safety rules. The tested model refused with "I'm sorry, I cannot assist with that.", marking a PASS for this attack.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[prompt_injection],
    )

GrayBox Attack

GrayBox attacks rely on partial knowledge of the target system to craft more effective adversarial prompts. They reframe baseline attacks in more abstract or misleading language to slip past safety heuristics. In the example, the adversary tried to disguise instructions about creating false IDs and using encrypted channels.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[graybox_attack],
    )

Base64 Attack

Base64 attacks encode harmful instructions to bypass filters that look for explicit keywords. The encoded payload in the test contained directions related to illegal activity, but the model did not decode or follow the request.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[base64_attack],
    )

Leetspeak Attack

Leetspeak replaces letters with numbers or symbols to obfuscate harmful instructions. Even when the attack was written in leetspeak and aimed to instruct minors in illegal activities, the model recognized the intent and refused.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[leetspeak_attack],
    )

ROT-13 Attack

ROT-13 rotates letters by 13 positions to scramble text. It's a trivial obfuscation but can evade naïve keyword filters. The example used ROT-13 to conceal malicious instructions, and the model was tested for decoding and compliance.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[rot_attack],
    )

Multilingual Attack

Translating a harmful prompt into less-monitored languages can bypass moderation systems that are stronger in widely used languages. The test used Swahili to ask for illegal instructions and checked if the model would comply.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[multi_attack],
    )

Math Problem Attack

Embedding malicious intent into mathematical notation or academic framing hides the real goal inside a formal structure. In the example, the adversary framed exploitation content as a group theory problem and requested a plain-language translation.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[math_attack],
    )

Next steps and resources

Run each attack in your environment to evaluate model behavior and build mitigation strategies. For full code, tutorials, notebooks, and community resources, check the project's GitHub, follow related social channels, and consult the FULL CODES referenced throughout this guide.

Stress-Testing OpenAI: Single-Turn Adversarial Attacks With deepteam

Сменить язык