Kosmos Unveiled: An Autonomous AI Scientist Running 12-Hour Research Campaigns
'Kosmos is an autonomous system that runs up to 12-hour research campaigns, executing ~42,000 lines of code and reading ~1,500 papers to produce fully cited reports. Experts judged 79.4% of sampled statements accurate, with stronger reliability for data analysis and literature findings.'
What Kosmos does
Kosmos, developed by Edison Scientific, is an autonomous discovery system designed to pursue a single open-ended research objective over long, multi-step campaigns. Provided with a dataset and a natural language goal, Kosmos cycles through data analysis, literature search, and hypothesis generation. It aggregates findings into a fully cited scientific report that links statements to either executed notebook cells or specific passages in the literature.
A typical run can last up to 12 hours, perform roughly 200 agent rollouts, execute about 42,000 lines of code, and read around 1,500 papers.
Architecture and agent roles
The system centers on a structured world model that functions as long-term memory. Unlike an ordinary context window, this world model is a queryable database of entities, relationships, experimental results, and open questions that is updated after every task. Because it is structured and queryable, information produced early in a run remains available after tens of thousands of tokens of activity.
Kosmos uses two primary agent types: a data analysis agent and a literature search agent. Each cycle the system proposes up to 10 concrete tasks based on the research objective and the current world model. Tasks can include running a differential abundance analysis on metabolomics data, or searching for biological pathways linking a candidate gene to a disease phenotype. Agents generate code, execute it in a notebook environment, or retrieve and read papers, then write structured outputs and citations back into the world model.
A separate synthesis component traverses the world model at the end of a run to produce a final report. Every claim in the report is linked to provenance: either a Jupyter notebook cell or a specific passage in primary literature. This explicit provenance supports human auditing of individual claims rather than treating the system as a black box.
Evaluation: accuracy and human-equivalent time
To assess report quality, the team sampled 102 statements from three representative Kosmos reports and asked domain experts to classify each statement as supported or refuted. Overall, 79.4 percent of sampled statements were judged accurate. Breaking this down: data analysis statements were most reliable at about 85.5 percent accuracy, literature-derived statements were correct about 82.1 percent of the time, and synthesis statements that integrate multiple pieces of evidence were less reliable at around 57.9 percent.
The authors estimate human-equivalent effort by assuming 2 hours for a typical data analysis trajectory and 15 minutes to read a paper. Counting trajectories and papers per run gives roughly 4.1 expert months of effort for a typical Kosmos run (assuming a 40-hour work week). In a collaborator survey, seven scientists rated a 20-step Kosmos run as equivalent to approximately 6.14 months of their own work on the same objective; perceived effort scales roughly linearly with cycle count up to 20.
Representative discoveries
Kosmos was applied to seven case studies across metabolomics, materials science, neuroscience, statistical genetics, and neurodegeneration. In three cases it independently reproduced prior human results without accessing the original preprints during the run. In four cases it proposed mechanisms that the paper authors described as novel contributions.
Examples include:
-
Metabolomics: In a mouse hypothermia experiment, Kosmos identified nucleotide metabolism as the dominant altered pathway in hypothermic brains, concluding nucleotide salvage pathways dominate over de novo synthesis during protective hypothermia. This matched an independent human analysis that was unpublished at the time.
-
Materials science: From environmental logs of perovskite solar cell fabrication, Kosmos recovered the human finding that absolute humidity during thermal annealing is the main determinant of device efficiency and identified a critical humidity threshold beyond which devices fail.
-
Neuroscience: Given neuron-level reconstructions across species, Kosmos fitted distributions for neurite length, degree, and synapse counts, concluding degree and synapse distributions fit log-normal models better than scale-free ones and recovering power-law scaling between neurite length and synapse count.
Other discoveries described as novel include a Mendelian randomization analysis implicating circulating superoxide dismutase 2 in myocardial fibrosis protection, a Mechanistic Ranking Score for type 2 diabetes loci, a proteomic pseudotime analysis in Alzheimer disease, and a large-scale single nucleus transcriptomic link between flippase loss and neuronal vulnerability.
Strengths, limits, and human role
Kosmos demonstrates measurable gains in reasoning depth, reproducibility, and traceability by combining a structured world model with domain-agnostic agent workflows. It excels at data analysis and literature synthesis with explicit provenance that enables verification of claims. However, synthesis statements that combine evidence are less reliable than individual analysis or literature statements, and human scientists remain essential for dataset selection, objective framing, and validation of novel hypotheses.
Kosmos is a powerful template for AI-accelerated science that augments human researchers rather than replacing them. For technical details and the full paper, see the arXiv publication linked by the authors.
Сменить язык
Читать эту статью на русском