<RETURN_TO_BASE

Mastering Synthetic Data Generation with Synthetic Data Vault (SDV): A Complete Tutorial

This guide explains how to use the Synthetic Data Vault (SDV) library to create realistic synthetic tabular data, covering installation, metadata preparation, data generation, and quality evaluation.

Why Synthetic Data?

Real-world datasets often come with challenges such as high costs, messiness, and strict privacy constraints. Synthetic data provides a practical alternative, widely adopted in various AI applications:

  • Large Language Models (LLMs) train on AI-generated text.
  • Fraud detection systems simulate rare edge cases.
  • Computer vision models pretrain using synthetic images.

Introduction to SDV

The Synthetic Data Vault (SDV) is an open-source Python library designed to generate realistic tabular synthetic data by learning from real datasets using machine learning techniques. This enables safe data sharing, thorough testing, and effective model training without compromising privacy.

Installing SDV

Begin by installing the SDV library:

pip install sdv

Loading Your Dataset

Import the necessary modules and load your CSV dataset files from a local folder:

from sdv.io.local import CSVHandler
 
connector = CSVHandler()
FOLDER_NAME = '.'  # Assuming data is in the current directory
 
data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

This reads CSV files into pandas DataFrames. The main dataset is accessed via data['data'].

Preparing Metadata

SDV requires metadata that describes your dataset structure and column types. You can load existing metadata from a JSON file:

from sdv.metadata import Metadata
metadata = Metadata.load_from_json('metadata.json')

Metadata includes:

  • Table names and primary keys.
  • Column data types (categorical, numerical, datetime, etc.).
  • Optional formats like datetime patterns or ID regex.
  • Table relationships for multi-table datasets.

Example metadata.json format:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}

Alternatively, you can automatically infer metadata:

metadata = Metadata.detect_from_dataframes(data)

However, automatic detection may require manual review for accuracy.

Generating Synthetic Data

With metadata and data loaded, train a model and generate synthetic data:

from sdv.single_table import GaussianCopulaSynthesizer
 
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

The num_rows parameter controls the size of the synthetic dataset.

Evaluating Synthetic Data Quality

SDV offers evaluation tools to compare synthetic data quality against the original dataset:

from sdv.evaluation.single_table import evaluate_quality
 
quality_report = evaluate_quality(salesDf, synthetic_data, metadata)

You can also visualize column distributions:

from sdv.evaluation.single_table import get_column_plot
 
fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name='Sales',
    metadata=metadata
)
fig.show()

This plot shows similar distributions for the 'Sales' column in both datasets.

Comparing Monthly Sales Trends

Use pandas and matplotlib to compare average monthly sales:

import pandas as pd
import matplotlib.pyplot as plt
 
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')
 
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)
 
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')
 
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)
 
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')
 
plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)
plt.tight_layout()
plt.show()

This graph confirms that synthetic data closely follows real monthly sales trends with minimal differences.

Summary

Using SDV, you can efficiently generate high-quality synthetic tabular data that mimics real datasets, preserving crucial statistical properties. This enables robust data analysis and machine learning workflows while respecting privacy and data-sharing limitations.

Explore the full tutorial and source code on GitHub, and follow the project for updates and community resources.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский