Introducing SETA: Open Source RL Environments for Terminal Agents

What is SETA?

What does an end-to-end stack for terminal agents look like when you combine structured toolkits, synthetic RL environments, and benchmark-aligned evaluation? A team of researchers from CAMEL AI, Eigent AI, and other collaborators have released SETA, a toolkit and environment stack that focuses on reinforcement learning for terminal agents. This project targets agents that operate inside a Unix-style shell and must complete verifiable tasks under a benchmark harness such as Terminal Bench.

Three Main Contributions

State-of-the-art Terminal Agent: Achieves leading performance with a Claude Sonnet 4.5 based agent on Terminal Bench 2.0 and a GPT 4.1 based agent on Terminal Bench 1.0, restricted to the same base model.
Scalable RL Training: An initial synthetic dataset with 400 terminal tasks is released, covering various difficulty levels. Out of these, 260 tasks are used for RLVR finetuning of a Qwen3-8B model.
Clean Agent Design: A single agent implementation is utilized for both local task runs and the official Terminal Bench evaluation

Terminal Toolkit and Log Structure

The SETA code repository includes a Terminal Toolkit that transforms a language model into an executable terminal agent. Each task run creates a structured log directory under evaluation/terminal_bench_run, with a concrete layout demonstrated for a task called play-zork.

Key files include:

chatagent.log: Records the full history of agent messages and tool calls, including test results.
A sessions directory containing session_logs that capture terminal interactions.
Specific files like blocking_commands.log and session_run_zork_1_correct_path.log store command outputs for different sessions.
tests.log: Records the test run output, with tests.log.strip removing terminal control characters.

This log structure facilitates debugging by tracing high-level decisions down to individual shell commands.

Note Taking Toolkit as Persistent Memory

The research team introduces a Note Taking Toolkit that functions as persistent memory for long-horizon tasks. The toolkit allows the agent to write and read notes in a structured format while solving terminal tasks, fostering explicit channels for externalizing intermediate results.

Understanding the Performance

SETA’s agent harness leads in performance on Terminal Bench. With Claude Sonnet 4.5, the CAMEL agent achieves 46.5% accuracy on Terminal Bench 2.0, surpassing the second system by 3 percentage points. For Terminal Bench 1.0, a GPT 4.1 based agent scores 35% accuracy, again above the next entry. The supervised Qwen3 8B baseline achieved only 3.4%, but the Qwen3 8B terminal agent trained with the SETA RL pipeline significantly improves over this baseline in curated synthetic environments.

Key Takeaways

Joint Community Project: SETA offers agent toolkits and synthetic RL environments tailored for terminal agents and aligned with the Terminal Bench evaluation format.
Outstanding Performance: Demonstrates state-of-the-art results for CAMEL terminal agents using Claude Sonnet 4.5 and GPT 4.1.
400 Synthetic Terminal Tasks: Available on Hugging Face, each packaged as task.yaml, Dockerfile, and run-tests.sh.
Structured Logging and Memory Tools: Includes a Terminal Toolkit with structured logging and a Note Taking Toolkit that integrates with Terminal Bench evaluation scripts.
Reproducible Design: Offers a clean, reproducible stack for training, debugging, and evaluating terminal agents without relying on ad hoc examples.