Setup and initialization

Start by installing required libraries and initializing Opik. The snippet below loads core modules, detects the device, and configures the project so that every trace flows into the correct workspace.

!pip install -q opik transformers accelerate torch

 
import torch
from transformers import pipeline
import textwrap
 
 
import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio
 
 
device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")
 
 
opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

This lays the foundation for a reproducible tracing workspace.

Local model and generator

Load a lightweight Hugging Face model and create a helper to generate text locally without external APIs. This provides a consistent generation layer for the pipeline.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)
 
 
def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

Structured prompts for planning and answering

Define clear prompt templates for the planning and answering phases. Using explicit templates helps maintain consistency and makes it easier to inspect model behavior under structured prompting.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.
 
 
       Context:
       {{context}}
 
 
       Question:
       {{question}}
 
 
       Return exactly 3 bullet points as a plan.
   """).strip(),
)
 
 
answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.
 
 
       Context:
       {{context}}
 
 
       Question:
       {{question}}
 
 
       Plan:
       {{plan}}
 
 
       Answer the question in 2–4 concise sentences.
   """).strip(),
)

Minimal document store and retrieval

Create a small document store and a retrieval function that Opik tracks as a tool. This simulates a RAG-style workflow without an external vector DB.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}
 
 
@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

Traced pipeline functions

Wrap pipeline components with Opik's tracking decorators so each function span and LLM call is traced and visible in the Opik dashboard.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)
 
 
@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)
 
 
@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer
 
 
print("Sample answer:\n", qa_pipeline("What does Opik help developers do?"))

This connects retrieval, planning, and answering into a fully traced QA pipeline.

Dataset creation inside Opik

Create and populate a dataset to serve as ground truth for evaluation. Insert question–answer pairs that cover different aspects of Opik's features.

client = Opik()
 
 
dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)
 
 
dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

Define evaluation task and metrics

Select metrics and define the evaluation task that runs the QA pipeline on dataset items, returning outputs formatted for scoring.

equals_metric = Equals()
lev_metric = LevenshteinRatio()
 
 
def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question")]
   return {
       "output": output,
       "reference": item["reference"],
   }

Note: the evaluation task runs the traced pipeline for each dataset item and packages results for Opik's evaluation engine.

Running the experiment and aggregating results

Use Opik's evaluate function to run a full experiment. For deterministic behavior in an interactive environment, run sequentially.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)
 
 
print("\nExperiment URL:", evaluation_result.experiment_url)

After the experiment finishes, aggregate scores to inspect performance and identify areas for improvement.

agg = evaluation_result.aggregate_evaluation_scores()
 
 
print("\nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

Observations

This tutorial demonstrates a compact but fully instrumented LLM evaluation workflow using a local model and Opik. With tracing, structured prompts, a managed dataset, and explicit metrics, you gain transparent, measurable, and reproducible evidence about pipeline behavior and model output quality.

Build a Fully Traced Local LLM QA Pipeline with Opik for Reproducible, Transparent Evaluations