StreamTensor: Streaming LLM Intermediates from PyTorch to FPGA Dataflows
What StreamTensor aims to do
StreamTensor is a compiler that transforms PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators targeting AMD’s Alveo U55C FPGA. Instead of treating inference as batched kernels that move intermediate tiles back and forth to DRAM, StreamTensor pipes tiles through on-chip FIFOs and converter kernels, minimizing off-chip round-trips and reducing latency and energy use.
Core concepts and flow
Models enter the toolchain via Torch-MLIR, are lowered to MLIR Linalg, and then translated into a dataflow IR. Nodes in that IR become hardware kernels with explicit stream interfaces and associated host/runtime glue. The system automates kernel emission, DMA insertion, and runtime wiring without manual RTL assembly.
The compiler introduces an iterative tensor type, or itensor, that encodes iteration order, tiling, and layout. By making stream order explicit, itensors enable safe inter-kernel streaming and drive the generation of minimal buffer and layout converters only where producer and consumer formats differ.
Hierarchical design-space exploration
StreamTensor searches three interleaved design spaces:
- Linalg-level transformations: tiling, unroll, vectorization, and permutation
- Fusion choices under memory and resource constraints
- Resource allocation and stream-width selection
This hierarchical DSE optimizes sustained throughput subject to bandwidth and on-chip memory limits.
FIFO sizing and deadlock avoidance
Rather than using heuristics, StreamTensor formulates FIFO sizing as a linear program. The LP computes buffer sizes that avoid stalls and deadlocks while minimizing BRAM/URAM usage. That formal approach is essential to guarantee correctness when kernels stream tiles between each other on-chip.
Performance highlights
On LLM decoding workloads the team reports significant gains: geometric-mean latency down to 0.64× versus a GPU baseline on GPT-2 and up to 0.76× versus prior FPGA LLM accelerators. Energy efficiency reaches up to 1.99× compared to an A100 on certain models. These results are shown for decoding workloads and depend on model and configuration.
Platform context and scope
The work targets the Alveo U55C platform (16 GB HBM2, 460 GB/s, PCIe Gen3×16 or dual Gen4×8, dual QSFP28). The streaming dataflow design leverages HBM bandwidth and on-chip BRAM/URAM to keep intermediate tiles moving on-chip and to limit DRAM DMAs to cases where they are strictly necessary.
Why this matters
StreamTensor demonstrates that a compiler-driven streaming approach can replace many DRAM round-trips in LLM decoding, yielding lower latency and higher energy efficiency on specialized FPGA hardware. The key enablers are the itensor typing abstraction, automated converter synthesis, hierarchical DSE, and LP-based FIFO sizing that together make inter-kernel streaming provably safe and efficient.
Where to read more
The authors provide a paper with technical details and an accompanying GitHub with tutorials, code, and notebooks. The results focus on decoding workloads and a particular FPGA platform, but the ideas around itensors and formal FIFO sizing are broadly applicable to compiler-driven dataflow accelerators.