Mastering LLM Evaluation with MLflow: A Step-by-Step Guide Using Google Gemini
This tutorial shows how to use MLflow to evaluate Google Gemini's responses on factual prompts with integrated metrics, combining OpenAI and Google APIs for comprehensive LLM assessment.
Introduction to MLflow for LLM Evaluation
MLflow is a versatile open-source platform designed to manage the full machine learning lifecycle. Traditionally, it has been used for tracking experiments, logging parameters, and managing deployments. Recently, MLflow expanded its capabilities to include evaluation support for Large Language Models (LLMs).
Setting Up Dependencies
This tutorial demonstrates evaluating the Google Gemini model's performance on fact-based prompts using MLflow. The process also involves OpenAI's API since MLflow uses GPT-based models for judging metrics like answer similarity and faithfulness.
You need API keys from both OpenAI and Google Gemini:
- OpenAI API key: https://platform.openai.com/settings/organization/api-keys
- Google Gemini API key: https://ai.google.dev/gemini-api/docs
Install necessary libraries with:
pip install mlflow openai pandas google-genaiSet environment variables for the API keys securely:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')Preparing Evaluation Dataset
Create a dataset with factual prompts and their correct answers. This dataset will be used to compare Gemini's outputs against known truths.
import pandas as pd
eval_data = pd.DataFrame(
{
"inputs": [
"Who developed the theory of general relativity?",
"What are the primary functions of the liver in the human body?",
"Explain what HTTP status code 404 means.",
"What is the boiling point of water at sea level in Celsius?",
"Name the largest planet in our solar system.",
"What programming language is primarily used for developing iOS apps?",
],
"ground_truth": [
"Albert Einstein developed the theory of general relativity.",
"The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
"HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
"The boiling point of water at sea level is 100 degrees Celsius.",
"Jupiter is the largest planet in our solar system.",
"Swift is the primary programming language used for iOS app development."
]
}
)Generating Responses from Gemini
Define a helper function to get responses from the Gemini 1.5 Flash model using Google's Generative AI SDK and apply it to each prompt.
from google import genai
client = genai.Client()
def gemini_completion(prompt: str) -> str:
response = client.models.generate_content(
model="gemini-1.5-flash",
contents=prompt
)
return response.text.strip()
eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)Evaluating Gemini Outputs with MLflow
Run an MLflow evaluation session comparing Gemini's predictions against the ground truth using metrics like answer_similarity, exact_match, latency, and token_count.
import mlflow
mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")
with mlflow.start_run():
results = mlflow.evaluate(
model_type="question-answering",
data=eval_data,
predictions="predictions",
targets="ground_truth",
extra_metrics=[
mlflow.metrics.genai.answer_similarity(),
mlflow.metrics.exact_match(),
mlflow.metrics.latency(),
mlflow.metrics.token_count()
]
)
print("Aggregated Metrics:")
print(results.metrics)
# Save detailed table
results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)Inspecting Detailed Results
Load and display the saved detailed evaluation results for thorough analysis.
results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
resultsThis practical guide enables efficient and transparent evaluation of LLMs like Gemini using MLflow's integrated metrics and APIs.
Сменить язык
Читать эту статью на русском