Train High-Performance Supervised Models with Active Learning and Just a Few Labels

Supervised models typically demand lots of labeled data, but manual annotation is costly and slow. Active learning offers a practical path: the model actively chooses which samples to label, dramatically reducing annotation effort while preserving — or even improving — performance.

Why labeled data is a bottleneck

Manually annotating thousands of samples is expensive and time-consuming. In many real projects you start with a large pool of unlabeled examples and only a small fraction of labeled ones. Rather than labeling randomly or in bulk, active learning focuses human effort where it matters most.

What is active learning and how it helps

Active learning is a strategy where the learner queries an oracle (typically a human annotator) for labels of the most informative samples. Instead of passively training on a fully labeled dataset, the model iteratively selects the examples it is most uncertain about, asks for labels, and retrains. This interactive loop produces faster learning curves with far fewer annotated samples.

Typical active learning workflow

Start with a small, labeled seed set and train a weak initial model.
Use that model to predict probabilities and compute a confidence or uncertainty metric on the unlabeled pool.
Query labels for the most uncertain samples and add them to the labeled set.
Retrain the model and repeat until the annotation budget is exhausted or performance plateaus.

The example below demonstrates a practical simulation using sklearn, a logistic regression classifier, and the "least confidence" query strategy.

Setup: install and import required libraries

pip install numpy pandas scikit-learn matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Experiment parameters

SEED = 42 # For reproducibility
N_SAMPLES = 1000 # Total number of data points
INITIAL_LABELED_PERCENTAGE = 0.10 # Your constraint: Start with 10% labeled data
NUM_QUERIES = 20 # Number of times we ask the "human" to label a confusing sample

In this simulation NUM_QUERIES = 20 represents an annotation budget: the model will request labels for 20 of the most uncertain samples, and we simulate the human oracle by revealing the true labels automatically.

Data generation and splitting for active learning

X, y = make_classification(
    n_samples=N_SAMPLES, n_features=10, n_informative=5, n_redundant=0,
    n_classes=2, n_clusters_per_class=1, flip_y=0.1, random_state=SEED
)
 
# 1. Split into 90% Pool (samples to be queried) and 10% Test (final evaluation)
X_pool, X_test, y_pool, y_test = train_test_split(
    X, y, test_size=0.10, random_state=SEED, stratify=y
)
 
# 2. Split the 90% Pool into Initial Labeled (10% of the pool) and Unlabeled (90% of the pool)
X_labeled_current, X_unlabeled_full, y_labeled_current, y_unlabeled_full = train_test_split(
    X_pool, y_pool, test_size=1.0 - INITIAL_LABELED_PERCENTAGE,
    random_state=SEED, stratify=y_pool
)
 
# A set to track indices in the unlabeled pool for efficient querying and removal
unlabeled_indices_set = set(range(X_unlabeled_full.shape[0]))
 
print(f"Initial Labeled Samples (STARTING N): {len(y_labeled_current)}")
print(f"Unlabeled Pool Samples: {len(unlabeled_indices_set)}")

Initial training and baseline evaluation

labeled_size_history = []
accuracy_history = []
 
# Train the baseline model on the small initial labeled set
baseline_model = LogisticRegression(random_state=SEED, max_iter=2000)
baseline_model.fit(X_labeled_current, y_labeled_current)
 
# Evaluate performance on the held-out test set
y_pred_init = baseline_model.predict(X_test)
accuracy_init = accuracy_score(y_test, y_pred_init)
 
# Record the baseline point (x=90, y=0.8800)
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy_init)
 
print(f"INITIAL BASELINE (N={labeled_size_history[0]}): Test Accuracy: {accuracy_history[0]:.4f}")

Active learning loop (query, label, retrain)

current_model = baseline_model # Start the loop with the baseline model
 
print(f"\nStarting Active Learning Loop ({NUM_QUERIES} Queries)...")
 
# -----------------------------------------------
# The Active Learning Loop (Query, Annotate, Retrain, Evaluate)
# Purpose: Run 20 iterations to demonstrate strategic labeling gains.
# -----------------------------------------------
for i in range(NUM_QUERIES):
    if not unlabeled_indices_set:
        print("Unlabeled pool is empty. Stopping.")
        break
    
    # --- A. QUERY STRATEGY: Find the Least Confident Sample ---
    # 1. Get probability predictions from the CURRENT model for all unlabeled samples
    probabilities = current_model.predict_proba(X_unlabeled_full)
    max_probabilities = np.max(probabilities, axis=1)
 
    # 2. Calculate Uncertainty Score (1 - Max Confidence)
    uncertainty_scores = 1 - max_probabilities
 
    # 3. Identify the index of the sample with the MAXIMUM uncertainty score
    current_indices_list = list(unlabeled_indices_set)
    current_uncertainty = uncertainty_scores[current_indices_list]
    most_uncertain_idx_in_subset = np.argmax(current_uncertainty)
    query_index_full = current_indices_list[most_uncertain_idx_in_subset]
    query_uncertainty_score = uncertainty_scores[query_index_full]
 
    # --- B. HUMAN ANNOTATION SIMULATION ---
    # This is the single critical step where the human annotator intervenes.
    # We look up the true label (y_unlabeled_full) for the sample the model asked for.
    X_query = X_unlabeled_full[query_index_full].reshape(1, -1)
    y_query = np.array([y_unlabeled_full[query_index_full]])
    
    # Update the Labeled Set: Add the new annotated sample (N becomes N+1)
    X_labeled_current = np.vstack([X_labeled_current, X_query])
    y_labeled_current = np.hstack([y_labeled_current, y_query])
    # Remove the sample from the unlabeled pool
    unlabeled_indices_set.remove(query_index_full)
    
    # --- C. RETRAIN and EVALUATE ---
    # Train the NEW model on the larger, improved labeled set
    current_model = LogisticRegression(random_state=SEED, max_iter=2000)
    current_model.fit(X_labeled_current, y_labeled_current)
 
    # Evaluate the new model on the held-out test set
    y_pred = current_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Record results for plotting
    labeled_size_history.append(len(y_labeled_current))
    accuracy_history.append(accuracy)
 
    # Output status
    print(f"\nQUERY {i+1}: Labeled Samples: {len(y_labeled_current)}")
    print(f"  > Test Accuracy: {accuracy:.4f}")
    print(f"  > Uncertainty Score: {query_uncertainty_score:.4f}")

Final result and interpretation

After the simulated 20 queries the model improves noticeably: in the example run the test accuracy rose from 0.8800 to 0.9100 by labeling only 20 carefully chosen samples (the labeled set grew from 90 to 110). This shows how an active learner acts as an efficient curator of labeling budget — each annotation gives more value than random sampling.

Plotting learning progress

plt.figure(figsize=(10, 6))
plt.plot(labeled_size_history, accuracy_history, marker='o', linestyle='-', color='#00796b', label='Active Learning (Least Confidence)')
plt.axhline(y=final_accuracy, color='red', linestyle='--', alpha=0.5, label='Final Accuracy')
plt.title('Active Learning: Accuracy vs. Number of Labeled Samples')
plt.xlabel('Number of Labeled Samples')
plt.ylabel('Test Set Accuracy')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

Active learning is a practical technique when labeling budgets are limited: by asking for the right labels, you can build models that approach fully supervised performance with far fewer annotations.

Train High-Performance Supervised Models with Active Learning and Just a Few Labels

Сменить язык