Mastering LLM Evaluation with MLflow: A Step-by-Step Guide Using Google Gemini

Introduction to MLflow for LLM Evaluation

MLflow is a versatile open-source platform designed to manage the full machine learning lifecycle. Traditionally, it has been used for tracking experiments, logging parameters, and managing deployments. Recently, MLflow expanded its capabilities to include evaluation support for Large Language Models (LLMs).

Setting Up Dependencies

This tutorial demonstrates evaluating the Google Gemini model's performance on fact-based prompts using MLflow. The process also involves OpenAI's API since MLflow uses GPT-based models for judging metrics like answer similarity and faithfulness.

You need API keys from both OpenAI and Google Gemini:

OpenAI API key: https://platform.openai.com/settings/organization/api-keys
Google Gemini API key: https://ai.google.dev/gemini-api/docs

Install necessary libraries with:

pip install mlflow openai pandas google-genai

Set environment variables for the API keys securely:

import os
from getpass import getpass
 
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')

Preparing Evaluation Dataset

Create a dataset with factual prompts and their correct answers. This dataset will be used to compare Gemini's outputs against known truths.

import pandas as pd
 
eval_data = pd.DataFrame(
    {
        "inputs": [
            "Who developed the theory of general relativity?",
            "What are the primary functions of the liver in the human body?",
            "Explain what HTTP status code 404 means.",
            "What is the boiling point of water at sea level in Celsius?",
            "Name the largest planet in our solar system.",
            "What programming language is primarily used for developing iOS apps?",
        ],
        "ground_truth": [
            "Albert Einstein developed the theory of general relativity.",
            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
            "The boiling point of water at sea level is 100 degrees Celsius.",
            "Jupiter is the largest planet in our solar system.",
            "Swift is the primary programming language used for iOS app development."
        ]
    }
)

Generating Responses from Gemini

Define a helper function to get responses from the Gemini 1.5 Flash model using Google's Generative AI SDK and apply it to each prompt.

from google import genai
 
client = genai.Client()
 
def gemini_completion(prompt: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=prompt
    )
    return response.text.strip()
 
eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)

Evaluating Gemini Outputs with MLflow

Run an MLflow evaluation session comparing Gemini's predictions against the ground truth using metrics like answer_similarity, exact_match, latency, and token_count.

import mlflow
 
mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")
 
with mlflow.start_run():
    results = mlflow.evaluate(
        model_type="question-answering",
        data=eval_data,
        predictions="predictions",
        targets="ground_truth",
        extra_metrics=[
          mlflow.metrics.genai.answer_similarity(),
          mlflow.metrics.exact_match(),
          mlflow.metrics.latency(),
          mlflow.metrics.token_count()
      ]
    )
    print("Aggregated Metrics:")
    print(results.metrics)
 
    # Save detailed table
    results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)

Inspecting Detailed Results

Load and display the saved detailed evaluation results for thorough analysis.

results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results

This practical guide enables efficient and transparent evaluation of LLMs like Gemini using MLflow's integrated metrics and APIs.