Build an Agentic AI for Literature Analysis
Create an AI framework that analyzes literature, generates hypotheses, plans experiments, simulates results, and reports findings.
Introduction
In this tutorial, we build a complete scientific discovery agent step by step and experience how each component works together to form a coherent research workflow. We begin by loading our literature corpus, constructing retrieval and LLM modules, and then assembling agents that search papers, generate hypotheses, design experiments, and produce structured reports.
Literature Corpus and Dependencies
import sys, subprocess
def install_deps():
pkgs = ["transformers", "scikit-learn", "numpy"]
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + pkgs)
try:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
except ImportError:
install_deps()
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
np.random.seed(42)
LITERATURE = [...] # Literature Data
corpus_texts = [p["abstract"] + " " + p["title"] for p in LITERATURE]
vectorizer = TfidfVectorizer(stop_words="english")
corpus_matrix = vectorizer.fit_transform(corpus_texts)
MODEL_NAME = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)We laid the foundation for our scientific agent by loading libraries and preparing the literature corpus. With the model and data structured, we create the computational backbone for everything that follows.
Literature Search Component
@dataclass
class PaperHit:
paper: Dict[str, Any]
score: float
class LiteratureAgent:
def __init__(self, vectorizer, corpus_matrix, papers: List[Dict[str, Any]]):
...
def search(self, query: str, k: int = 3) -> List[PaperHit]:
q_vec = self.vectorizer.transform([query])
sims = cosine_similarity(q_vec, self.corpus_matrix)[0]
idxs = np.argsort(-sims)[:k]
hits = [PaperHit(self.papers[i], float(sims[i])) for i in idxs]
return hitsWe implemented the literature-search component of our agent to identify the most relevant papers using cosine similarity. This gives our system grounding in the closest-matching prior work.
Designing Experiments
@dataclass
class ExperimentPlan:
....
class ExperimentAgent:
def design_experiment(self, question: str, hypothesis: str, hits: List[PaperHit]) -> ExperimentPlan:
...
def run_experiment(self, plan: ExperimentPlan) -> ExperimentResult:
base = 0.78 + 0.02 * np.random.randn()
gain = abs(0.05 + 0.01 * np.random.randn())
metrics = {"baseline_AUROC": round(base, 3), "augmented_AUROC": round(base + gain, 3)}
return ExperimentResult(plan=plan, metrics=metrics)We perform experiments based on the retrieved literature and generated hypotheses, allowing for actionable experimental plans.
Report Generation
class ReportAgent:
def write_report(self, question: str, hits: List[PaperHit], plan: ExperimentPlan, result: ExperimentResult) -> str:
...We generate a full research-style report using the LLM by assembling the hypothesis, protocol, results, and related work into a structured document.
Conclusion
In conclusion, we see how a compact codebase can evolve into a functioning AI co-researcher capable of searching, reasoning, simulating, and summarizing. Each snippet contributes to the overall pipeline. With a simple yet rich architecture, we can extend functionalities to enhance our scientific exploration.
Сменить язык
Читать эту статью на русском