<RETURN_TO_BASE

Building Portable In-Database Feature Engineering Pipelines

Learn to use Ibis for in-database feature engineering without pulling data locally.

Introduction to Ibis and DuckDB

In this tutorial, we demonstrate how we use Ibis to build a portable, in-database feature engineering pipeline that looks and feels like Pandas but executes entirely inside the database. We show how we connect to DuckDB, register data safely inside the backend, and define complex transformations using window functions and aggregations without ever pulling raw data into local memory.

By keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL.

Setting Up Your Environment

We install the required libraries and initialize the Ibis environment. Run the following command to install necessary dependencies:

!pip -q install "ibis-framework[duckdb,examples]" duckdb pyarrow pandas

Next, we establish a DuckDB connection and enable interactive execution so that all subsequent operations remain lazy and backend-driven.

import ibis
from ibis import _
 
print("Ibis version:", ibis.__version__)
 
con = ibis.duckdb.connect()
ibis.options.interactive = True

Loading Data into DuckDB

We load the Penguins dataset and explicitly register it inside the DuckDB catalog to ensure it is available for SQL execution. The following code snippet demonstrates this:

try:
   base_expr = ibis.examples.penguins.fetch(backend=con)
except TypeError:
   base_expr = ibis.examples.penguins.fetch()
 
if "penguins" not in con.list_tables():
   try:
       con.create_table("penguins", base_expr, overwrite=True)
   except Exception:
       con.create_table("penguins", base_expr.execute(), overwrite=True)
 
t = con.table("penguins")
print(t.schema())

Defining Feature Engineering Pipeline

We define a reusable feature engineering pipeline using pure Ibis expressions. This involves computing derived features, applying data cleaning, and using window functions and grouped aggregations. Here's the implementation:

def penguin_feature_pipeline(penguins):
   base = penguins.mutate(
       bill_ratio=_.bill_length_mm / _.bill_depth_mm,
       is_male=(_.sex == "male").ifelse(1, 0),
   )
 
   cleaned = base.filter(
       _.bill_length_mm.notnull()
       & _.bill_depth_mm.notnull()
       & _.body_mass_g.notnull()
       & _.flipper_length_mm.notnull()
       & _.species.notnull()
       & _.island.notnull()
       & _.year.notnull()
   )
 
   w_species = ibis.window(group_by=[cleaned.species])
   w_island_year = ibis.window(
       group_by=[cleaned.island],
       order_by=[cleaned.year],
       preceding=2,
       following=0,
   )
 
   feat = cleaned.mutate(
       species_avg_mass=cleaned.body_mass_g.mean().over(w_species),
       species_std_mass=cleaned.body_mass_g.std().over(w_species),
       mass_z=(
           cleaned.body_mass_g
           - cleaned.body_mass_g.mean().over(w_species)
       ) / cleaned.body_mass_g.std().over(w_species),
       island_mass_rank=cleaned.body_mass_g.rank().over(
           ibis.window(group_by=[cleaned.island])
       ),
       rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(
           w_island_year
       ),
   )
 
   return feat.group_by(["species", "island", "year"]).agg(
       n=feat.count(),
       avg_mass=feat.body_mass_g.mean(),
       avg_flipper=feat.flipper_length_mm.mean(),
       avg_bill_ratio=feat.bill_ratio.mean(),
       avg_mass_z=feat.mass_z.mean(),
       avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),
       pct_male=feat.is_male.mean(),
   ).order_by(["species", "island", "year"])

Executing the Feature Pipeline

We invoke the feature pipeline and compile it into DuckDB SQL. Here's how:

features = penguin_feature_pipeline(t)
print(con.compile(features))
 
try:
   df = features.to_pandas()
except Exception:
   df = features.execute()
 
display(df.head())

Materializing and Exporting Results

Finally, we materialize the engineered features, verify them, and export the results:

con.create_table("penguin_features", features, overwrite=True)
 
feat_tbl = con.table("penguin_features")
 
try:
   preview = feat_tbl.limit(10).to_pandas()
except Exception:
   preview = feat_tbl.limit(10).execute()
 
display(preview)
 
out_path = "/content/penguin_features.parquet"
con.raw_sql(f"COPY penguin_features TO '{out_path}' (FORMAT PARQUET);")
print(out_path)

Final Thoughts

In this tutorial, we constructed, compiled, and executed an advanced feature engineering workflow fully inside DuckDB using Ibis. This approach keeps computation close to the data, minimizes unnecessary data movement, and maintains a single, reusable Python codebase that scales from local experimentation to production databases.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский