Pre-Launch RAG Testing: Generate Synthetic Goldens with DeepEval
Why evaluate a RAG pipeline?
Retrieval-Augmented Generation (RAG) systems combine a retriever and an LLM to ground answers in external documents. Without systematic evaluation, you can’t confidently know whether the retriever returns relevant context, whether the model hallucinates, or whether the provided context size and composition are optimal.
Install dependencies
The tutorial uses DeepEval to generate synthetic evaluation datasets. First install the required Python packages:
!pip install deepeval chromadb tiktoken pandas
OpenAI API key
DeepEval leverages external language models to synthesize questions and expected outputs, so you’ll need an OpenAI API key for this walkthrough. Generate a key on the OpenAI API Key Management page and ensure your account is activated (billing may require a small initial payment).
Prepare a source document
Create a text variable that contains diverse factual information across topics (biology, physics, history, space exploration, environment, medicine, computing, ancient civilizations). DeepEval’s synthesizer will split the text into semantically coherent chunks, choose useful contexts, and create synthetic “golden” pairs (input, expected_output) for evaluation.
text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.
Moving to history, the Library of Alexandria was once the largest center of learning, but much of its collection was
lost in fires and wars, becoming a symbol of human curiosity and fragility. In space exploration, the Voyager 1 probe,
launched in 1977, has now left the solar system, carrying a golden record that captures sounds and images of Earth.
Closer to home, the Amazon rainforest produces roughly 20% of the world's oxygen, while coral reefs -- often called the
"rainforests of the sea" -- support nearly 25% of all marine life despite covering less than 1% of the ocean floor.
In medicine, MRI scanners use strong magnetic fields and radio waves
to generate detailed images of organs without harmful radiation.
In computing, Moore's Law observed that the number of transistors
on microchips doubles roughly every two years, though recent advances
in AI chips have shifted that trend.
The Mariana Trench is the deepest part of Earth's oceans,
reaching nearly 11,000 meters below sea level, deeper than Mount Everest is tall.
Ancient civilizations like the Sumerians and Egyptians invented
mathematical systems thousands of years before modern algebra emerged.
"""
with open("example.txt", "w") as f:
f.write(text)
Generate synthetic evaluation data
Use DeepEval’s Synthesizer to automatically generate synthetic goldens from your document. In the example below the lightweight model “gpt-4.1-nano” is used to generate question–answer pairs and context snippets drawn from the document.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer(model="gpt-4.1-nano")
# Generate synthetic goldens from your document
synthesizer.generate_goldens_from_docs(
document_paths=["example.txt"],
include_expected_output=True
)
# Print generated results
for golden in synthesizer.synthetic_goldens[:3]:
print(golden, "\n")
Control input complexity with EvolutionConfig
DeepEval can evolve and diversify inputs to create richer, more challenging evaluation cases. Use EvolutionConfig to assign weights to different evolution types (REASONING, MULTICONTEXT, COMPARATIVE, HYPOTHETICAL, IN_BREADTH) and set how many evolution strategies should be applied per chunk.
from deepeval.synthesizer.config import EvolutionConfig, Evolution
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 1/5,
Evolution.MULTICONTEXT: 1/5,
Evolution.COMPARATIVE: 1/5,
Evolution.HYPOTHETICAL: 1/5,
Evolution.IN_BREADTH: 1/5,
},
num_evolutions=3
)
synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])
Why synthetic goldens help
Synthetic datasets let you establish a baseline and a continuous evaluation loop before any real users interact with your system. Evolution-guided goldens probe different reasoning styles, multi-context comparisons, and hypothetical scenarios. With metrics such as grounding and context coverage, you can iteratively refine the retriever and the generation model to reduce hallucinations and improve relevance.
Integrating into a continuous improvement loop
Once you generate synthetic goldens, feed them into your RAG pipeline to measure performance (retrieval accuracy, grounding, answer quality). Use those metrics to tune retriever settings, retriever-model interface, and prompt/context strategies. This creates an iterative RAG improvement loop that raises confidence in production behavior long before launch.
Resources
DeepEval’s GitHub hosts tutorials, code, and notebooks to help reproduce and extend the examples. The framework makes it feasible to benchmark your RAG pipeline pre-launch and maintain rigorous testing as the system evolves.