Building Portable In-Database Feature Engineering Pipelines
Learn to use Ibis for in-database feature engineering without pulling data locally.
Introduction to Ibis and DuckDB
In this tutorial, we demonstrate how we use Ibis to build a portable, in-database feature engineering pipeline that looks and feels like Pandas but executes entirely inside the database. We show how we connect to DuckDB, register data safely inside the backend, and define complex transformations using window functions and aggregations without ever pulling raw data into local memory.
By keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL.
Setting Up Your Environment
We install the required libraries and initialize the Ibis environment. Run the following command to install necessary dependencies:
!pip -q install "ibis-framework[duckdb,examples]" duckdb pyarrow pandasNext, we establish a DuckDB connection and enable interactive execution so that all subsequent operations remain lazy and backend-driven.
import ibis
from ibis import _
print("Ibis version:", ibis.__version__)
con = ibis.duckdb.connect()
ibis.options.interactive = TrueLoading Data into DuckDB
We load the Penguins dataset and explicitly register it inside the DuckDB catalog to ensure it is available for SQL execution. The following code snippet demonstrates this:
try:
base_expr = ibis.examples.penguins.fetch(backend=con)
except TypeError:
base_expr = ibis.examples.penguins.fetch()
if "penguins" not in con.list_tables():
try:
con.create_table("penguins", base_expr, overwrite=True)
except Exception:
con.create_table("penguins", base_expr.execute(), overwrite=True)
t = con.table("penguins")
print(t.schema())Defining Feature Engineering Pipeline
We define a reusable feature engineering pipeline using pure Ibis expressions. This involves computing derived features, applying data cleaning, and using window functions and grouped aggregations. Here's the implementation:
def penguin_feature_pipeline(penguins):
base = penguins.mutate(
bill_ratio=_.bill_length_mm / _.bill_depth_mm,
is_male=(_.sex == "male").ifelse(1, 0),
)
cleaned = base.filter(
_.bill_length_mm.notnull()
& _.bill_depth_mm.notnull()
& _.body_mass_g.notnull()
& _.flipper_length_mm.notnull()
& _.species.notnull()
& _.island.notnull()
& _.year.notnull()
)
w_species = ibis.window(group_by=[cleaned.species])
w_island_year = ibis.window(
group_by=[cleaned.island],
order_by=[cleaned.year],
preceding=2,
following=0,
)
feat = cleaned.mutate(
species_avg_mass=cleaned.body_mass_g.mean().over(w_species),
species_std_mass=cleaned.body_mass_g.std().over(w_species),
mass_z=(
cleaned.body_mass_g
- cleaned.body_mass_g.mean().over(w_species)
) / cleaned.body_mass_g.std().over(w_species),
island_mass_rank=cleaned.body_mass_g.rank().over(
ibis.window(group_by=[cleaned.island])
),
rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(
w_island_year
),
)
return feat.group_by(["species", "island", "year"]).agg(
n=feat.count(),
avg_mass=feat.body_mass_g.mean(),
avg_flipper=feat.flipper_length_mm.mean(),
avg_bill_ratio=feat.bill_ratio.mean(),
avg_mass_z=feat.mass_z.mean(),
avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),
pct_male=feat.is_male.mean(),
).order_by(["species", "island", "year"])Executing the Feature Pipeline
We invoke the feature pipeline and compile it into DuckDB SQL. Here's how:
features = penguin_feature_pipeline(t)
print(con.compile(features))
try:
df = features.to_pandas()
except Exception:
df = features.execute()
display(df.head())Materializing and Exporting Results
Finally, we materialize the engineered features, verify them, and export the results:
con.create_table("penguin_features", features, overwrite=True)
feat_tbl = con.table("penguin_features")
try:
preview = feat_tbl.limit(10).to_pandas()
except Exception:
preview = feat_tbl.limit(10).execute()
display(preview)
out_path = "/content/penguin_features.parquet"
con.raw_sql(f"COPY penguin_features TO '{out_path}' (FORMAT PARQUET);")
print(out_path)Final Thoughts
In this tutorial, we constructed, compiled, and executed an advanced feature engineering workflow fully inside DuckDB using Ibis. This approach keeps computation close to the data, minimizes unnecessary data movement, and maintains a single, reusable Python codebase that scales from local experimentation to production databases.
Сменить язык
Читать эту статью на русском