DS STAR: Google’s Multi‑Agent System That Turns Messy Files into Executable Python Analytics

Turning natural questions into Python over messy data

Google Research presents DS STAR (Data Science Agent via Iterative Planning and Verification), a multi‑agent framework that transforms open‑ended data science questions into executable Python scripts that operate directly on heterogeneous file formats — CSV, JSON, Markdown and unstructured text. Unlike Text‑to‑SQL approaches that assume a clean relational database, DS STAR treats the task as Text‑to‑Python and works across realistic enterprise data lakes.

From file summaries to a shared context

The system begins by summarizing every file in the dataset. An Aanalyzer agent generates small Python scripts to parse each file and print essential metadata: column names, data types, simple statistics and text snippets. DS STAR executes those scripts and captures concise descriptions for each file. The collection of these descriptions becomes the shared context used by downstream agents, enabling reasoning about both structured and unstructured data sources.

Iterative planning, coding and verification loop

DS STAR mirrors a human notebook workflow with a multi‑agent refinement loop:

Aplanner creates an initial executable plan based on the user query and file descriptions.
Acoder converts the current plan into Python code which is executed to produce an observation.
Averifier, an LLM‑based judge, inspects the plan, the code and the run result and returns a binary verdict: sufficient or insufficient.
Arouter decides whether to add a new step or to truncate and regenerate from an earlier step when the plan is insufficient.

Each new planning step is conditioned on the latest execution result, so refinements address specific failures. The loop repeats until Averifier accepts the solution or a 20‑round limit is reached. A final Afinalyzer formats the output to meet strict benchmark rules (rounding, CSV output, etc.).

Robustness: debugging and retrieval

DS STAR adds practical robustness modules. Adebugger repairs failing scripts by conditioning on the script, the traceback and the previously generated analyzer descriptions, which is crucial for data‑centric bugs that depend on schema and column names. A Retriever handles datasets with thousands of candidate files: embeddings of the query and file descriptions select the top 100 relevant files (the team used Gemini Embedding 001 in experiments) to limit agent context.

Benchmarks and results

All main experiments used Gemini 2.5 Pro as the base LLM with up to 20 refinement rounds. DS STAR shows large gains over prior agents:

On DABStep, the base model alone scored 12.70% on hard tasks; DS STAR reached 45.24% on hard and 87.50% on easy tasks.
Overall accuracy improvements reported versus the best alternatives: DABStep from 41.0% to 45.2%, KramaBench from 39.8% to 44.7%, DA Code from 37.0% to 38.5%.
For KramaBench with retrieval, DS STAR plus Gemini 2.5 Pro achieved a normalized score of 44.69 versus 39.79 for the strongest baseline.
On DA Code hard tasks, DS STAR reached 37.1% versus 32.0% for DA Agent when both used Gemini 2.5 Pro.

What this means for data automation

DS STAR reframes data agents from Text‑to‑SQL to Text‑to‑Python over mixed file types and demonstrates that explicit agent structure — analyzer summaries, iterative planning, verifier, router and debugger — is essential for end‑to‑end analytics. The approach is model‑agnostic in architecture and benefits substantially from iterative refinement, offering a practical path from demo‑style table queries to benchmarked, end‑to‑end analytics systems.