Train a ChatGPT-Style Model in ~4 Hours for ~$100 with Karpathy's nanochat

What nanochat is

Andrej Karpathy open-sourced nanochat, a compact, dependency-light repository that implements a full ChatGPT-style stack end-to-end. The codebase covers tokenizer training, base pretraining, mid-training on conversational/multiple-choice/tool-use data, Supervised Finetuning (SFT), optional reinforcement learning on GSM8K, evaluation, and serving with both a CLI and a simple ChatGPT-like web UI.

One-script speedrun and cost

The repo includes a single-script “speedrun” that runs the entire loop: tokenization, base pretraining, mid-training, SFT, optional RL, evaluation and serving. The recommended hardware is an 8×H100 node. At roughly $24/hour per node, the speedrun completes in about 4 hours and lands near a $100 compute cost. A post-run report.md is produced summarizing metrics such as CORE, ARC-E/C, MMLU, GSM8K, HumanEval and ChatCORE.

Tokenizer and data pipeline

nanochat uses a custom Rust BPE tokenizer (built via Maturin) with a 65,536-token vocabulary. Tokenizer training is done on FineWeb-EDU shards that the repo repackages and shuffles for easy access. The walkthrough reports about 4.8 characters per token compression and compares this tokenizer’s behavior against GPT-2/4 tokenizers.

An evaluation bundle is provided (downloaded to ~/.cache/nanochat/eval_bundle) which contains a curated set of datasets used for CORE-style evaluation (22 autocompletion datasets like HellaSwag, ARC, BoolQ, and others).

Model, scaling and the speedrun target

The speedrun config trains a depth-20 Transformer (≈560M parameters with 1280 hidden channels and 10 attention heads of dim 128) for roughly 11.2B tokens, aligning with Chinchilla-style scaling (params × ~20 tokens). The implementation uses Muon for matmul parameters and AdamW for embeddings/unembeddings. Loss is reported in bits-per-byte (bpb) to stay tokenizer-invariant. Karpathy estimates the trained model’s compute capability at around ~4e19 FLOPs for the speedrun setup.

Mid-training, SFT and tool use

After base pretraining the pipeline performs mid-training to adapt the model to conversational formats (SmolTalk), to teach multiple-choice behavior (using 100K MMLU auxiliary-train questions), and to seed tool use via explicit <|python_start|>…<|python_end|> blocks. A small slice of GSM8K is included to bootstrap calculator-style behavior. The default mid-training mixture in the speedrun is SmolTalk (460K rows), MMLU aux-train (100K), and GSM8K main (8K), totaling 568K rows.

SFT fine-tunes the model on higher-quality conversations and enforces test-time formatting (padded, non-concatenated rows) to reduce mismatch between training and inference. Example post-SFT metrics for the speedrun tier include ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854 and ChatCORE 0.0884.

Tool use is implemented end-to-end: a custom Engine handles KV cache, prefill/decode inference, and provides a simple Python interpreter sandbox used by both training and evaluation flows.

Optional RL on GSM8K (simplified GRPO)

An optional final stage applies reinforcement learning on GSM8K with a simplified GRPO routine. The walkthrough lists deliberate omissions relative to canonical PPO-style RLHF: no trust region via a reference model, no KL penalties, on-policy updates without PPO ratios/clipping, token-level GAPO-style normalization, and a mean-shift advantage. In practice the routine behaves similarly to REINFORCE while keeping group-relative advantage calculations. Scripts such as scripts.chat_rl and scripts.chat_eval -i rl -a GSM8K demonstrate the loop.

Evaluation snapshot and scaling tiers

A sample report.md for the ~$100 / ~4-hour speedrun shows the following evolution: CORE 0.2219 (base); after mid-training and SFT ARC-Easy moved to ~0.3876, ARC-Challenge ~0.2807, MMLU ~0.3151, GSM8K ~0.0455, HumanEval ~0.0854, ChatCORE ~0.0884. Wall-clock for that run was about 3h51m.

The README sketches larger scaling targets: a ~$300 tier (d=26, ~12 hours) expected to slightly surpass GPT-2 CORE, and a ~$1,000 tier (~41.6 hours) for materially better coherence and basic reasoning/coding ability. Prior experimental longer runs (d=30 for ~24 hours) showed stronger scores on MMLU and ARC-Easy and GSM8K.

Why this matters

nanochat sits in a practical middle ground: a single, clean, ~8k LOC repository that makes a full, reproducible ChatGPT-style pipeline hackable and runnable on a single multi-GPU node. It provides a transparent training path from tokenizer to web UI, example metric reports, and clear scaling options for those who want to explore end-to-end LLM training without a large distributed setup.