<RETURN_TO_BASE

Build a Mini‑GPT with Tinygrad: Hands‑On Transformer Internals from Scratch

'Step-by-step Tinygrad guide that builds tensors, attention, transformer blocks and a mini-GPT from scratch, with full code and experiments on training and kernel fusion.'

Getting started with Tinygrad

This tutorial walks through implementing the core building blocks of modern transformers using Tinygrad. You will remain hands-on with tensors, autograd, attention, and transformer block internals as we progressively build every component from low level tensor ops to a working mini‑GPT model.

Part 1 — Tensor operations and autograd

We begin by setting up Tinygrad and exercising basic tensor operations and backpropagation. The short code example below demonstrates matrix multiplication, elementwise ops and gradient flow through a small computation graph.

import subprocess, sys, os
print("Installing dependencies...")
subprocess.check_call(["apt-get", "install", "-qq", "clang"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/tinygrad/tinygrad.git"])
 
 
import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time
 
 
print(f" Using device: {Device.DEFAULT}")
print("=" * 60)
 
 
print("\n PART 1: Tensor Operations & Autograd")
print("-" * 60)
 
 
x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)
 
 
z = (x @ y).sum() + (x ** 2).mean()
z.backward()
 
 
print(f"x:\n{x.numpy()}")
print(f"y:\n{y.numpy()}")
print(f"z (scalar): {z.numpy()}")
print(f"∂z/∂x:\n{x.grad.numpy()}")
print(f"∂z/∂y:\n{y.grad.numpy()}")

Running this reveals how Tinygrad constructs a computation graph and propagates gradients through matrix operations. Examining the printed gradients gives intuition about backpropagation mechanics.

Part 2 — Building custom layers

Next we implement multi‑head attention and a transformer block from raw tensor operations. These classes show the projections for q/k/v, attention score computation, softmax, feedforward layers and a simple layer normalization implementation.

print("\n\n PART 2: Building Custom Layers")
print("-" * 60)
 
 
class MultiHeadAttention:
   def __init__(self, dim, num_heads):
       self.num_heads = num_heads
       self.dim = dim
       self.head_dim = dim // num_heads
       self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
       self.out = Tensor.glorot_uniform(dim, dim)
  
   def __call__(self, x):
       B, T, C = x.shape[0], x.shape[1], x.shape[2]
       qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
       q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
       scale = (self.head_dim ** -0.5)
       attn = (q @ k.transpose(-2, -1)) * scale
       attn = attn.softmax(axis=-1)
       out = (attn @ v).transpose(1, 2).reshape(B, T, C)
       return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)
 
 
class TransformerBlock:
   def __init__(self, dim, num_heads):
       self.attn = MultiHeadAttention(dim, num_heads)
       self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
       self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
       self.ln1_w = Tensor.ones(dim)
       self.ln2_w = Tensor.ones(dim)
  
   def __call__(self, x):
       x = x + self.attn(self._layernorm(x, self.ln1_w))
       ff = x.reshape(-1, x.shape[-1])
       ff = ff.dot(self.ff1).gelu().dot(self.ff2)
       x = x + ff.reshape(x.shape)
       return self._layernorm(x, self.ln2_w)
  
   def _layernorm(self, x, w):
       mean = x.mean(axis=-1, keepdim=True)
       var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
       return w * (x - mean) / (var + 1e-5).sqrt()

Implementing these pieces from scratch clarifies the roles of each subcomponent in a transformer layer and how shapes move through projections, attention and feedforward paths.

Part 3 — Mini‑GPT architecture

We then assemble a compact MiniGPT by combining token and positional embeddings, stacking transformer blocks and projecting to vocabulary logits.

print("\n PART 3: Mini-GPT Architecture")
print("-" * 60)
 
 
class MiniGPT:
   def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
       self.vocab_size = vocab_size
       self.dim = dim
       self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
       self.pos_emb = Tensor.glorot_uniform(max_len, dim)
       self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
       self.ln_f = Tensor.ones(dim)
       self.head = Tensor.glorot_uniform(dim, vocab_size)
  
   def __call__(self, idx):
       B, T = idx.shape[0], idx.shape[1]
       tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
       pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
       x = tok_emb + pos_emb
       for block in self.blocks:
           x = block(x)
       mean = x.mean(axis=-1, keepdim=True)
       var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
       x = self.ln_f * (x - mean) / (var + 1e-5).sqrt()
       return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)
  
   def get_params(self):
       params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
       for block in self.blocks:
           params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
       return params
 
 
model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = model.get_params()
total_params = sum(p.numel() for p in params)
print(f"Model initialized with {total_params:,} parameters")

This composition highlights how a small transformer is built from embedding layers, positional encodings, stacked blocks and a final linear head.

Part 4 — Training loop

The training loop shows a minimal example of training the MiniGPT on synthetic data to predict previous tokens. It uses an Adam optimizer and sparse categorical crossentropy.

print("\n\n PART 4: Training Loop")
print("-" * 60)
 
 
def gen_data(batch_size, seq_len):
   x = np.random.randint(0, 256, (batch_size, seq_len))
   y = np.roll(x, 1, axis=1)
   y[:, 0] = x[:, 0]
   return Tensor(x, dtype='int32'), Tensor(y, dtype='int32')
 
 
optimizer = optim.Adam(params, lr=0.001)
losses = []
 
 
print("Training to predict previous token in sequence...")
with Tensor.train():
   for step in range(20):
       start = time.time()
       x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
       logits = model(x_batch)
       B, T, V = logits.shape[0], logits.shape[1], logits.shape[2]
       loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()
       losses.append(loss.numpy())
       elapsed = time.time() - start
       if step % 5 == 0:
           print(f"Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms")

Watching loss values step down during training demonstrates that the model learns the synthetic task and verifies end to end gradients through the custom layers.

Part 5 — Lazy evaluation and kernel fusion

Tinygrad supports lazy execution and kernel fusion. The snippet below creates a computation graph that only executes when realized, allowing operations to be fused for improved performance.

print("\n\n PART 5: Lazy Evaluation & Kernel Fusion")
print("-" * 60)
 
 
N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)
 
 
print("Creating computation: (A @ B.T + A).sum()")
lazy_result = (a @ b.T + a).sum()
print("→ No computation done yet (lazy evaluation)")
 
 
print("Calling .realize() to execute...")
start = time.time()
realized = lazy_result.realize()
elapsed = time.time() - start
 
 
print(f"✓ Computed in {elapsed*1000:.2f}ms")
print(f"Result: {realized.numpy():.4f}")
print("\nNote: Operations were fused into optimized kernels!")

By timing realize calls you can observe how fused kernels reduce overhead and speed up large computations.

Part 6 — Custom operations

You can also define custom activations and verify gradients flow through them just like built in ops.

print("\n\n PART 6: Custom Operations")
print("-" * 60)
 
 
def custom_activation(x):
   return x * x.sigmoid()
 
 
x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()
 
 
print(f"Input:    {x.numpy()}")
print(f"Swish(x): {y.numpy()}")
print(f"Gradient: {x.grad.numpy()}")
 
 
print("\n\n" + "=" * 60)
print(" Tutorial Complete!")
print("=" * 60)
print("""
Key Concepts Covered:
1. Tensor operations with automatic differentiation
2. Custom neural network layers (Attention, Transformer)
3. Building a mini-GPT language model from scratch
4. Training loop with Adam optimizer
5. Lazy evaluation and kernel fusion
6. Custom activation functions
""
)

Working through all parts gives you a transparent view of how modern transformer models operate beneath high level frameworks. The complete codebase referenced in the tutorial provides runnable notebooks and full examples to reproduce experiments and extend them on your own.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский