AutoDS by Allen Institute: Revolutionizing Scientific Discovery with Bayesian Surprise
AutoDS, a new engine from the Allen Institute for AI, autonomously drives scientific discovery by leveraging Bayesian surprise and large language models to generate and test hypotheses without predefined goals.
Introducing AutoDS: Autonomous Scientific Discovery Engine
The Allen Institute for Artificial Intelligence (AI2) has developed AutoDS (Autonomous Discovery via Surprisal), a pioneering engine designed for open-ended autonomous scientific discovery. Unlike traditional AI research assistants that rely on pre-defined goals or queries, AutoDS independently generates, tests, and refines hypotheses by measuring and pursuing “Bayesian surprise” — a rigorous metric for genuine discovery beyond human-directed objectives.
From Goal-Based Research to Curiosity-Driven Exploration
Conventional autonomous scientific discovery methods focus on answering specific research questions by generating hypotheses relevant to those predetermined problems and then validating them experimentally. AutoDS breaks away from this model by emulating human scientists’ curiosity-driven approach. It autonomously decides which questions to ask, which hypotheses to investigate, and how to build upon previous findings without any fixed goals.
Navigating a vast space of possible hypotheses and prioritizing which to test is a significant challenge. AutoDS addresses this by formalizing “surprisal” — the quantifiable change in belief regarding a hypothesis before and after obtaining empirical data.
Measuring Bayesian Surprise Using Large Language Models
At the heart of AutoDS is an innovative framework that estimates Bayesian surprise. For each hypothesis, cutting-edge large language models (LLMs) like GPT-4o serve as probabilistic observers, expressing their belief about the hypothesis in probabilistic terms before and after experimental evaluation. These beliefs are modeled with Beta distributions derived from multiple LLM-generated judgments.
AutoDS calculates the Kullback-Leibler (KL) divergence between these posterior and prior Beta distributions to quantify surprise. Only belief changes that surpass a threshold—such as shifting from likely true to likely false—are considered meaningful discoveries, helping the system focus on substantive insights rather than trivial updates.
Efficient Hypothesis Exploration with Monte Carlo Tree Search
To efficiently explore the enormous hypothesis space, AutoDS employs Monte Carlo Tree Search (MCTS) with progressive widening. Each node in the search tree represents a hypothesis, with branches corresponding to new hypotheses conditioned on earlier results. This approach balances exploring new possibilities and exploiting promising leads.
Unlike greedy or beam search methods, which may overcommit or prematurely narrow the search, MCTS maintains high discovery efficiency under limited computational resources. Experiments across 21 datasets in biology, economics, and behavioral science show that AutoDS discovers 5–29% more surprising hypotheses than traditional search methods.
Modular Multi-Agent LLM Framework
AutoDS coordinates multiple specialized LLM agents, each handling a different scientific workflow component:
- Hypothesis Generation
- Experimental Design
- Programming and Execution
- Results Analysis and Revision
To ensure the uniqueness of discoveries, AutoDS uses hierarchical clustering based on LLM-generated text embeddings and semantic equivalence checks to remove duplicate hypotheses.
Alignment with Human Expertise and Interpretability
Human evaluation involving experts with MS/PhD STEM backgrounds found that 67% of hypotheses AutoDS identified as surprising were also recognized as such by domain experts. The Bayesian surprise metric showed a closer correlation with human judgment than alternative metrics like predicted interestingness or utility.
Different scientific domains exhibited distinct patterns in belief shifts, with confirmatory claims often requiring stronger evidence to be surprisingly confirmed compared to novel falsifications.
Practical Implementation and Future Directions
AutoDS demonstrated over 98% implementation accuracy as judged by human reviewers. While current implementations rely on API-based LLMs, which introduce latency, a programmatic search variant offers faster but less nuanced results.
Though still a research prototype with plans for open-sourcing, AutoDS’s architecture and empirical success highlight a promising future for scalable AI-driven scientific discovery.
Explore the Paper, GitHub, and Blog for more details. All credit goes to the AI2 research team.
Сменить язык
Читать эту статью на русском