Building Production-Grade AutoML Pipelines with AutoGluon
Create efficient AutoML pipelines for tabular models using AutoGluon.
Overview
In this tutorial, we build a production-grade tabular machine learning pipeline using AutoGluon, processing a real-world mixed-type dataset from raw ingestion to deployment-ready artifacts.
Setting Up the Environment
We start by installing the required libraries:
!pip -q install -U "autogluon==1.5.0" "scikit-learn>=1.3" "pandas>=2.0" "numpy>=1.24"And configuring necessary imports:
import os, time, json, warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score, classification_report
from autogluon.tabular import TabularPredictorPreparing the Dataset
Using fetch_openml(), we load a real-world dataset:
df = fetch_openml(data_id=40945, as_frame=True).frame
target = "survived"
df[target] = df[target].astype(int)
drop_cols = [c for c in ["boat", "body", "home.dest"] if c in df.columns]
df = df.drop(columns=drop_cols, errors="ignore")We validate the dataset and perform a stratified train-test split:
train_df, test_df = train_test_split(
df,
test_size=0.2,
random_state=42,
stratify=df[target],
)Model Initialization
We check for GPU availability to select the training preset:
def has_gpu():
try:
import torch
return torch.cuda.is_available()
except Exception:
return False
presets = "extreme" if has_gpu() else "best_quality"
predictor = TabularPredictor(
label=target,
eval_metric="roc_auc",
path="/content/autogluon_titanic_advanced",
verbosity=2
)Training the Model
Next, we fit the predictor:
start = time.time()
predictor.fit(
train_data=train_df,
presets=presets,
time_limit=7 * 60,
num_bag_folds=5,
num_stack_levels=2,
refit_full=False
)
train_time = time.time() - start
print(f"\nTraining done in {train_time:.1f}s with presets='{presets}'")Evaluating the Model
Using the test set, we evaluate our model and generate key metrics:
lb = predictor.leaderboard(test_df, silent=True)
print("
=== Leaderboard (top 15) ===")
display(lb.head(15))
proba = predictor.predict_proba(test_df)
pred = predictor.predict(test_df)Analyzing Model Behavior
We can perform group analysis and feature importance evaluation:
fi = predictor.feature_importance(test_df, silent=True)
print("\n=== Feature importance (top 20) ===")
display(fi.head(20))Optimizing for Inference
We can collapse bagged models for faster inference:
refit_map = predictor.refit_full()
print(f"\nrefit_full completed in {t_refit:.1f}s")
lb_full = predictor.leaderboard(test_df, silent=True)
print("
=== Leaderboard after refit_full (top 15) ===")
display(lb_full.head(15))Conclusion
By implementing an end-to-end workflow with AutoGluon, we can efficiently manage raw tabular data, ensuring it is ready for production with high performance and reliability.
Сменить язык
Читать эту статью на русском