Yandex Launches Alchemist: A Precision Dataset to Boost Text-to-Image Model Quality

Challenges in Text-to-Image Generation

Despite notable advancements from models like DALL-E 3, Imagen 3, and Stable Diffusion 3, producing consistently high-quality text-to-image (T2I) outputs remains challenging. Large-scale pretraining provides broad knowledge but falls short in achieving superior aesthetic quality and precise alignment with prompts. Supervised fine-tuning (SFT) is crucial to enhance model performance, yet its success heavily depends on the quality of the fine-tuning dataset.

Limitations of Existing Datasets

Most publicly available datasets for SFT focus on narrow domains such as anime or specific art styles, or they employ simplistic heuristic filters on massive web-scraped data. Human curation is expensive, does not scale well, and often misses samples that could yield the most significant improvements. Additionally, many recent T2I models rely on proprietary datasets, limiting transparency and reproducibility.

Introducing Alchemist: A Model-Guided Dataset

Yandex has addressed these issues by releasing Alchemist, a compact, general-purpose SFT dataset containing 3,350 carefully selected image-text pairs. Unlike traditional datasets, Alchemist is curated using a novel methodology that leverages a pre-trained diffusion model as a sample quality estimator. This model-guided approach selects training samples that maximize generative performance without subjective human labels or basic aesthetic scoring.

Dataset Construction and Filtering Pipeline

Alchemist's construction starts from approximately 10 billion web images and follows a multi-stage filtering process:

Initial Filtering: Removal of NSFW content and low-resolution images below 1024×1024 pixels.
Coarse Quality Filtering: Classifiers trained on standard image quality datasets (like KonIQ-10k and PIPAL) exclude images with compression artifacts, motion blur, watermarks, and other defects.
Deduplication and IQA-Based Pruning: Using SIFT-like feature clustering to remove duplicates and retain only high-quality images, scored further by the TOPIQ model.
Diffusion-Based Selection: A key innovation uses cross-attention activations from a pre-trained diffusion model to rank images. The scoring prioritizes images with visual complexity, aesthetic appeal, and stylistic richness, selecting the most impactful samples for fine-tuning.
Caption Rewriting: Final images are re-captioned by a fine-tuned vision-language model to generate prompt-style descriptions, enhancing alignment and usability in fine-tuning.

Ablation studies revealed that increasing dataset size beyond 3,350 samples (e.g., to 7,000 or 19,000) reduces fine-tuned model quality, emphasizing the importance of data quality over quantity.

Evaluation Across Multiple Stable Diffusion Models

Alchemist was tested on five Stable Diffusion variants (SD1.5, SD2.1, SDXL, SD3.5 Medium, and SD3.5 Large), comparing fine-tuning on Alchemist, a size-matched LAION-Aesthetics v2 subset, and baseline models.

Human Evaluation: Experts rated text-image relevance, aesthetic quality, image complexity, and fidelity. Models fine-tuned with Alchemist showed significant gains in aesthetics and complexity (12–20% improvement), outperforming both baselines and LAION-Aesthetics-tuned models, while maintaining stable prompt alignment.
Automated Metrics: Scores such as FD-DINOv2, CLIP Score, ImageReward, and HPS-v2 consistently favored Alchemist-tuned models over counterparts.
Dataset Size Impact: Larger Alchemist variants led to decreased performance, confirming the value of strict filtering and high sample quality.

Practical Applications and Impact

Yandex has already employed Alchemist to train its proprietary text-to-image model, YandexART v2.5, and plans to use it for future updates. Alchemist sets a new benchmark for SFT datasets by focusing on selective, high-quality samples and offering an open, transparent resource for the research community.

This approach highlights a replicable path to improving T2I model outputs, especially in perceptual qualities like aesthetics and image complexity, while balancing trade-offs in fidelity for newer models.

For more details, explore the paper and the Alchemist dataset available on Hugging Face.