Build a Mini‑GPT with Tinygrad: Hands‑On Transformer Internals from Scratch
'Step-by-step Tinygrad guide that builds tensors, attention, transformer blocks and a mini-GPT from scratch, with full code and experiments on training and kernel fusion.'
Getting started with Tinygrad
This tutorial walks through implementing the core building blocks of modern transformers using Tinygrad. You will remain hands-on with tensors, autograd, attention, and transformer block internals as we progressively build every component from low level tensor ops to a working mini‑GPT model.
Part 1 — Tensor operations and autograd
We begin by setting up Tinygrad and exercising basic tensor operations and backpropagation. The short code example below demonstrates matrix multiplication, elementwise ops and gradient flow through a small computation graph.
import subprocess, sys, os
print("Installing dependencies...")
subprocess.check_call(["apt-get", "install", "-qq", "clang"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/tinygrad/tinygrad.git"])
import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time
print(f" Using device: {Device.DEFAULT}")
print("=" * 60)
print("\n PART 1: Tensor Operations & Autograd")
print("-" * 60)
x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)
z = (x @ y).sum() + (x ** 2).mean()
z.backward()
print(f"x:\n{x.numpy()}")
print(f"y:\n{y.numpy()}")
print(f"z (scalar): {z.numpy()}")
print(f"∂z/∂x:\n{x.grad.numpy()}")
print(f"∂z/∂y:\n{y.grad.numpy()}")Running this reveals how Tinygrad constructs a computation graph and propagates gradients through matrix operations. Examining the printed gradients gives intuition about backpropagation mechanics.
Part 2 — Building custom layers
Next we implement multi‑head attention and a transformer block from raw tensor operations. These classes show the projections for q/k/v, attention score computation, softmax, feedforward layers and a simple layer normalization implementation.
print("\n\n PART 2: Building Custom Layers")
print("-" * 60)
class MultiHeadAttention:
def __init__(self, dim, num_heads):
self.num_heads = num_heads
self.dim = dim
self.head_dim = dim // num_heads
self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
self.out = Tensor.glorot_uniform(dim, dim)
def __call__(self, x):
B, T, C = x.shape[0], x.shape[1], x.shape[2]
qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
scale = (self.head_dim ** -0.5)
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(axis=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)
class TransformerBlock:
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
self.ln1_w = Tensor.ones(dim)
self.ln2_w = Tensor.ones(dim)
def __call__(self, x):
x = x + self.attn(self._layernorm(x, self.ln1_w))
ff = x.reshape(-1, x.shape[-1])
ff = ff.dot(self.ff1).gelu().dot(self.ff2)
x = x + ff.reshape(x.shape)
return self._layernorm(x, self.ln2_w)
def _layernorm(self, x, w):
mean = x.mean(axis=-1, keepdim=True)
var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
return w * (x - mean) / (var + 1e-5).sqrt()Implementing these pieces from scratch clarifies the roles of each subcomponent in a transformer layer and how shapes move through projections, attention and feedforward paths.
Part 3 — Mini‑GPT architecture
We then assemble a compact MiniGPT by combining token and positional embeddings, stacking transformer blocks and projecting to vocabulary logits.
print("\n PART 3: Mini-GPT Architecture")
print("-" * 60)
class MiniGPT:
def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
self.vocab_size = vocab_size
self.dim = dim
self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
self.pos_emb = Tensor.glorot_uniform(max_len, dim)
self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
self.ln_f = Tensor.ones(dim)
self.head = Tensor.glorot_uniform(dim, vocab_size)
def __call__(self, idx):
B, T = idx.shape[0], idx.shape[1]
tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
mean = x.mean(axis=-1, keepdim=True)
var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
x = self.ln_f * (x - mean) / (var + 1e-5).sqrt()
return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)
def get_params(self):
params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
for block in self.blocks:
params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
return params
model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = model.get_params()
total_params = sum(p.numel() for p in params)
print(f"Model initialized with {total_params:,} parameters")This composition highlights how a small transformer is built from embedding layers, positional encodings, stacked blocks and a final linear head.
Part 4 — Training loop
The training loop shows a minimal example of training the MiniGPT on synthetic data to predict previous tokens. It uses an Adam optimizer and sparse categorical crossentropy.
print("\n\n PART 4: Training Loop")
print("-" * 60)
def gen_data(batch_size, seq_len):
x = np.random.randint(0, 256, (batch_size, seq_len))
y = np.roll(x, 1, axis=1)
y[:, 0] = x[:, 0]
return Tensor(x, dtype='int32'), Tensor(y, dtype='int32')
optimizer = optim.Adam(params, lr=0.001)
losses = []
print("Training to predict previous token in sequence...")
with Tensor.train():
for step in range(20):
start = time.time()
x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
logits = model(x_batch)
B, T, V = logits.shape[0], logits.shape[1], logits.shape[2]
loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.numpy())
elapsed = time.time() - start
if step % 5 == 0:
print(f"Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms")Watching loss values step down during training demonstrates that the model learns the synthetic task and verifies end to end gradients through the custom layers.
Part 5 — Lazy evaluation and kernel fusion
Tinygrad supports lazy execution and kernel fusion. The snippet below creates a computation graph that only executes when realized, allowing operations to be fused for improved performance.
print("\n\n PART 5: Lazy Evaluation & Kernel Fusion")
print("-" * 60)
N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)
print("Creating computation: (A @ B.T + A).sum()")
lazy_result = (a @ b.T + a).sum()
print("→ No computation done yet (lazy evaluation)")
print("Calling .realize() to execute...")
start = time.time()
realized = lazy_result.realize()
elapsed = time.time() - start
print(f"✓ Computed in {elapsed*1000:.2f}ms")
print(f"Result: {realized.numpy():.4f}")
print("\nNote: Operations were fused into optimized kernels!")By timing realize calls you can observe how fused kernels reduce overhead and speed up large computations.
Part 6 — Custom operations
You can also define custom activations and verify gradients flow through them just like built in ops.
print("\n\n PART 6: Custom Operations")
print("-" * 60)
def custom_activation(x):
return x * x.sigmoid()
x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()
print(f"Input: {x.numpy()}")
print(f"Swish(x): {y.numpy()}")
print(f"Gradient: {x.grad.numpy()}")
print("\n\n" + "=" * 60)
print(" Tutorial Complete!")
print("=" * 60)
print("""
Key Concepts Covered:
1. Tensor operations with automatic differentiation
2. Custom neural network layers (Attention, Transformer)
3. Building a mini-GPT language model from scratch
4. Training loop with Adam optimizer
5. Lazy evaluation and kernel fusion
6. Custom activation functions
""
)Working through all parts gives you a transparent view of how modern transformer models operate beneath high level frameworks. The complete codebase referenced in the tutorial provides runnable notebooks and full examples to reproduce experiments and extend them on your own.
Сменить язык
Читать эту статью на русском