Обучите мощные супервизируемые модели с помощью Active Learning и всего нескольких меток

Супервизируемые модели обычно требуют большого объёма размеченных данных, но ручная разметка дорогая и медленная. Active Learning позволяет модели выбирать, какие образцы важно разметить, тем самым значительно снижая затраты на аннотации при сохранении или улучшении качества.

Проблема нехватки размеченных данных

Ручная разметка тысяч примеров занимает много времени и ресурсов. В реальных проектах часто есть большой пул неразмеченных данных и лишь небольшая метка. Вместо случайной или массовой разметки Active Learning направляет человеческие усилия туда, где они приносят наибольший эффект.

Что такое Active Learning и как он помогает

Active Learning — стратегия, при которой обучаемая модель запрашивает у оракула (человека-разметчика) метки самых информативных примеров. Модель итеративно выбирает примеры с наибольшей неопределённостью, запрашивает их метки и дообучается. Такой интерактивный цикл ускоряет обучение и уменьшает число необходимых аннотаций.

Обычный рабочий процесс Active Learning

Начать с небольшой размеченной выборки и обучить слабую начальную модель.
Сгенерировать предсказания и оценки уверенности на неразмеченном пуле.
Запросить метки для наиболее неопределённых примеров и добавить их в обучающую выборку.
Дообучить модель и повторять цикл до исчерпания бюджета или стагнации качества.

Ниже приведён пример-подход на sklearn с логистической регрессией и стратегией «наименьшей уверенности».

Установка и импорт библиотек

pip install numpy pandas scikit-learn matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Параметры эксперимента

SEED = 42 # For reproducibility
N_SAMPLES = 1000 # Total number of data points
INITIAL_LABELED_PERCENTAGE = 0.10 # Your constraint: Start with 10% labeled data
NUM_QUERIES = 20 # Number of times we ask the "human" to label a confusing sample

В моделировании NUM_QUERIES = 20 — это бюджет разметки: модель запросит метки для 20 самых неопределённых образцов, а мы симулируем человека, автоматически раскрывая истинные метки.

Генерация данных и разбиение для Active Learning

X, y = make_classification(
    n_samples=N_SAMPLES, n_features=10, n_informative=5, n_redundant=0,
    n_classes=2, n_clusters_per_class=1, flip_y=0.1, random_state=SEED
)
 
# 1. Split into 90% Pool (samples to be queried) and 10% Test (final evaluation)
X_pool, X_test, y_pool, y_test = train_test_split(
    X, y, test_size=0.10, random_state=SEED, stratify=y
)
 
# 2. Split the 90% Pool into Initial Labeled (10% of the pool) and Unlabeled (90% of the pool)
X_labeled_current, X_unlabeled_full, y_labeled_current, y_unlabeled_full = train_test_split(
    X_pool, y_pool, test_size=1.0 - INITIAL_LABELED_PERCENTAGE,
    random_state=SEED, stratify=y_pool
)
 
# A set to track indices in the unlabeled pool for efficient querying and removal
unlabeled_indices_set = set(range(X_unlabeled_full.shape[0]))
 
print(f"Initial Labeled Samples (STARTING N): {len(y_labeled_current)}")
print(f"Unlabeled Pool Samples: {len(unlabeled_indices_set)}")

Начальное обучение и оценка baseline

labeled_size_history = []
accuracy_history = []
 
# Train the baseline model on the small initial labeled set
baseline_model = LogisticRegression(random_state=SEED, max_iter=2000)
baseline_model.fit(X_labeled_current, y_labeled_current)
 
# Evaluate performance on the held-out test set
y_pred_init = baseline_model.predict(X_test)
accuracy_init = accuracy_score(y_test, y_pred_init)
 
# Record the baseline point (x=90, y=0.8800)
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy_init)
 
print(f"INITIAL BASELINE (N={labeled_size_history[0]}): Test Accuracy: {accuracy_history[0]:.4f}")

Цикл Active Learning: запрос, разметка, дообучение

current_model = baseline_model # Start the loop with the baseline model
 
print(f"\nStarting Active Learning Loop ({NUM_QUERIES} Queries)...")
 
# -----------------------------------------------
# The Active Learning Loop (Query, Annotate, Retrain, Evaluate)
# Purpose: Run 20 iterations to demonstrate strategic labeling gains.
# -----------------------------------------------
for i in range(NUM_QUERIES):
    if not unlabeled_indices_set:
        print("Unlabeled pool is empty. Stopping.")
        break
    
    # --- A. QUERY STRATEGY: Find the Least Confident Sample ---
    # 1. Get probability predictions from the CURRENT model for all unlabeled samples
    probabilities = current_model.predict_proba(X_unlabeled_full)
    max_probabilities = np.max(probabilities, axis=1)
 
    # 2. Calculate Uncertainty Score (1 - Max Confidence)
    uncertainty_scores = 1 - max_probabilities
 
    # 3. Identify the index of the sample with the MAXIMUM uncertainty score
    current_indices_list = list(unlabeled_indices_set)
    current_uncertainty = uncertainty_scores[current_indices_list]
    most_uncertain_idx_in_subset = np.argmax(current_uncertainty)
    query_index_full = current_indices_list[most_uncertain_idx_in_subset]
    query_uncertainty_score = uncertainty_scores[query_index_full]
 
    # --- B. HUMAN ANNOTATION SIMULATION ---
    # This is the single critical step where the human annotator intervenes.
    # We look up the true label (y_unlabeled_full) for the sample the model asked for.
    X_query = X_unlabeled_full[query_index_full].reshape(1, -1)
    y_query = np.array([y_unlabeled_full[query_index_full]])
    
    # Update the Labeled Set: Add the new annotated sample (N becomes N+1)
    X_labeled_current = np.vstack([X_labeled_current, X_query])
    y_labeled_current = np.hstack([y_labeled_current, y_query])
    # Remove the sample from the unlabeled pool
    unlabeled_indices_set.remove(query_index_full)
    
    # --- C. RETRAIN and EVALUATE ---
    # Train the NEW model on the larger, improved labeled set
    current_model = LogisticRegression(random_state=SEED, max_iter=2000)
    current_model.fit(X_labeled_current, y_labeled_current)
 
    # Evaluate the new model on the held-out test set
    y_pred = current_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Record results for plotting
    labeled_size_history.append(len(y_labeled_current))
    accuracy_history.append(accuracy)
 
    # Output status
    print(f"\nQUERY {i+1}: Labeled Samples: {len(y_labeled_current)}")
    print(f"  > Test Accuracy: {accuracy:.4f}")
    print(f"  > Uncertainty Score: {query_uncertainty_score:.4f}")

Итоговое наблюдение

В примере после 20 запросов точность на тестовой выборке увеличилась с 0.8800 до 0.9100 при добавлении всего 20 целенаправленных меток (90 → 110 обучающих примеров). Это подтверждает, что разумная стратегия запросов меток гораздо эффективнее простой случайной разметки.

Построение графика результатов

plt.figure(figsize=(10, 6))
plt.plot(labeled_size_history, accuracy_history, marker='o', linestyle='-', color='#00796b', label='Active Learning (Least Confidence)')
plt.axhline(y=final_accuracy, color='red', linestyle='--', alpha=0.5, label='Final Accuracy')
plt.title('Active Learning: Accuracy vs. Number of Labeled Samples')
plt.xlabel('Number of Labeled Samples')
plt.ylabel('Test Set Accuracy')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

Active Learning позволяет составлять эффективные стратегии разметки: при ограниченном бюджете на аннотации вы добьётесь максимальной отдачи, спрашивая метки только у тех примеров, которые действительно увеличивают качество модели.

Обучите мощные супервизируемые модели с помощью Active Learning и всего нескольких меток

Switch Language