What DRBench is

ServiceNow Research has released DRBench, a benchmark and runnable environment for evaluating so-called deep research agents on realistic, open-ended enterprise tasks. Unlike web-only testbeds, DRBench mixes public web sources with private organizational data and forces agents to synthesize and attribute facts from heterogeneous enterprise artifacts into cited research reports.

Tasks and dataset composition

The initial release includes 15 deep research tasks spanning 10 enterprise domains such as sales, cybersecurity, and compliance. Each task defines a research question, a task context that sets the company and persona, and a set of groundtruth insights of three types: public insights from stable web URLs, internal relevant insights embedded in enterprise artifacts, and internal distractor insights intended to mislead. The creators inject these insights into realistic files and apps so agents must find the relevant items while avoiding distractors. The dataset construction mixes LLM generation with human verification and totals 114 groundtruth insights across tasks.

The containerized enterprise environment

A key contribution is the containerized enterprise environment that simulates common company services behind authentication and app-specific APIs. The DRBench Docker image orchestrates Nextcloud for shared documents and WebDAV, Mattermost for team chat, Roundcube with SMTP/IMAP for email, FileBrowser for filesystem access, and a VNC/NoVNC desktop for GUI interactions. Tasks are initialized by distributing documents, chats, and threaded emails across these services and provisioning users with consistent credentials. Agents can operate via web interfaces or programmatic APIs. The setup is intentionally needle-in-a-haystack: relevant insights are injected into PDFs, DOCX, PPTX, XLSX files, chats, and emails, padded with plausible but irrelevant content.

How evaluation works

DRBench evaluates agents along four axes that mirror analyst workflows: Insight Recall, Distractor Avoidance, Factuality, and Report Quality. Insight Recall breaks a report into atomic insights with citations and uses an LLM judge to match them against injected groundtruth insights, scoring recall. Distractor Avoidance penalizes inclusion of injected distractor insights. Factuality and Report Quality assess the correctness, clarity, and structure of the final report according to a rubric provided by the benchmark.

Baseline agent and research loop

The authors introduce a task-oriented baseline agent, DRBench Agent (DRBA), built to operate inside the DRBench environment. DRBA has four components: research planning, action planning, a research loop with Adaptive Action Planning (AAP), and report writing. Planning comes in two modes: Complex Research Planning (CRP), which outlines investigation areas, expected sources, and success criteria, and Simple Research Planning (SRP), which generates lightweight sub-queries. The iterative research loop chooses tools, processes content (including storing vectors in a vector store), identifies gaps, and repeats until completion or a max-iteration budget. The report writer synthesizes findings with citation tracking.

Why this matters for enterprise agents

Many deep research agents perform well on public web benchmarks, but real production use requires reliably finding internal needles, ignoring plausible internal distractors, and citing both public and private sources while navigating enterprise constraints like login, permissions, and UI friction. DRBench targets these gaps by grounding tasks in realistic company and persona contexts, distributing evidence across multiple enterprise apps plus the web, and explicitly scoring whether an agent extracted intended insights and produced a coherent, factual report. This end-to-end design makes DRBench a practical evaluation tool for system builders who need more than single-tool micro-scores.

Key takeaways

DRBench evaluates agents on complex, open-ended enterprise tasks that combine public web and private company data.
The initial release contains 15 tasks across 10 domains, grounded in realistic personas and organizational context.
Tasks span heterogeneous artifacts including productivity files, cloud file systems, email, chat, and the open web.
Reports are evaluated for insight recall, factual accuracy, and coherent report quality using rubric-based judging.
Code and benchmark assets are open-sourced on GitHub to enable reproducible evaluation and extension.

For more details see the DRBench paper on arXiv and the project GitHub page.

DRBench: ServiceNow’s Realistic Benchmark for Enterprise Deep-Research Agents