MIT LEGO: Auto-Compiling High-Performance Spatial Accelerators for AI

September 19, 2025 · 3 min

Hardware generation without templates

LEGO departs from traditional flows that either analyze dataflows or instantiate fixed-topology RTL templates. Instead, it accepts a high-level description of tensor programs and emits synthesizable RTL for spatial accelerators, enabling architectures tailored to each workload or fused set of workloads without manual template engineering.

Affine, relation-centric input IR

At the front end LEGO models tensor computations as loop nests with three index classes: temporal (for-loops), spatial (parallel FUs) and computation (iteration domain). Two affine relations express how data is mapped and spatialized: f_{I→D} maps computation indices to tensor coordinates, and f_{TS→I} maps temporal/spatial indices to computation indices. This affine-only representation turns reuse detection and address generation into linear-algebra problems and removes expensive modulo/division reasoning from core analysis. Control is decoupled via a control vector c that encodes signal propagation and delay, enabling shared control across functional units and reducing control logic overhead.

Front end: FU graph and memory co-design

LEGO synthesizes an FU-level architecture by solving reuse equations over the affine relations to discover direct and delayed (FIFO) connections between functional units. It prunes edges using minimum-spanning arborescences (Chu–Liu/Edmonds) where cost models FIFO depth, and applies a BFS-based heuristic to rewrite interconnects when multiple dataflows must coexist, prioritizing chain reuse to cut muxes and duplicated data nodes.

On-chip memory is banked automatically: LEGO computes required bank counts from concurrent access patterns and index deltas, optionally reducing banks by dividing by GCD. Data-distribution switches route between banks and FUs while direct FU-to-FU reuse is left to the interconnect synthesis.

When multiple spatial dataflows are fused, LEGO merges their interconnects into a single Architecture Description Graph (ADG) and carefully plans the merge to avoid naïve mux-heavy designs, yielding measurable energy savings compared to simple fusion strategies.

Back end: compile and optimize to RTL

The ADG is lowered to a Detailed Architecture Graph (DAG) of primitives such as FIFOs, muxes, adders and address generators. LEGO applies several LP and graph-transform passes to minimize datapath overhead:

Delay matching via linear programming: an LP picks per-node output delays to minimize the total inserted pipeline registers while meeting timing alignment.
Broadcast pin rewiring: a two-stage optimization converts costly broadcasts into forward chains using virtual cost shaping and MST-based rewiring, followed by delay rebalancing with an LP.
Reduction-tree extraction and pin reuse: sequential adder chains are balanced into trees and a 0-1 ILP remaps reducer inputs across dataflows so fewer physical pins (and muxes) are needed.

These datapath-focused optimizations produce substantial resource savings: LEGO reports roughly 35% area reduction versus naïve generation and around 28% energy savings from combined front-end and back-end passes.

Evaluation and outcomes

LEGO is implemented in C++ with HiGHS as the LP solver and emits SpinalHDL that is converted to Verilog. The evaluation covers kernel-level workloads and end-to-end models including AlexNet, MobileNetV2, ResNet-50, EfficientNetV2, BERT, GPT-2, DDPM/Stable Diffusion and LLaMA-7B. A single LEGO-generated accelerator instance is reused across models with a mapper selecting per-layer tiling and dataflow.

Against Gemmini under matched resources (256 MACs, 256 KB on-chip buffer, 128-bit bus at 16 GB/s), LEGO achieves on average a 3.2× speedup and 2.4× energy efficiency improvement. Key drivers are an accurate performance model guiding mapping decisions and dynamic spatial dataflow switching via generated interconnects. On larger configurations (e.g., 1024 FUs) LEGO sustains high utilization for generative workloads like DDPM/Stable Diffusion, while very large models such as LLaMA-7B remain bandwidth-limited.

Who benefits

Researchers gain a mathematically grounded path from loop-nest specifications to spatial hardware with LP-backed optimizations. Practitioners get a hardware-as-code flow that supports arbitrary and fused dataflows without template redesign. Product teams can lower the barrier to task-tuned, power-efficient edge silicon where the accelerator adapts to the model, improving energy and throughput across diverse workloads.