How CUDA, ROCm, Triton and TensorRT Shape GPU AI Performance: Compiler Paths and Tuning Tips
Deep-learning throughput depends on how effectively a compiler stack maps high-level tensor programs to GPU execution: thread and block schedules, memory movement, and instruction selection for matrix pipelines. Below I compare four dominant stacks from a compiler and tuning perspective and highlight which optimizations matter most in practice.
What drives performance on modern GPUs
Across vendors the same levers determine performance:
- Operator scheduling and fusion: reducing kernel launches and round-trips to HBM lets producer->consumer chains reuse registers and shared memory. Runtime fusion engines in libraries are a key example for attention and convolution blocks.
- Tiling and data layout: tiles must match native fragment sizes for Tensor Cores or vendor-specific matrix pipelines; avoid shared-memory bank conflicts and partition camping. Warp-level GEMM tiling is canonical guidance here.
- Precision and quantization: FP16/BF16/FP8 for training and mixed-precision inference; INT8/INT4 via calibration or QAT for inference. Builder tools automate calibration and kernel selection at these precisions.
- Graph capture and runtime specialization: capturing graphs amortizes launch overheads and enables dynamic fusion of common subgraphs such as attention.
- Autotuning: searching tile sizes, unroll factors, and pipelining depths per architecture and SKU is often decisive. Tooling that exposes autotune hooks or builder-time tactic selection speeds practical tuning.
CUDA: nvcc/ptxas, cuDNN, CUTLASS, and CUDA Graphs
Compiler path. CUDA code compiles through nvcc into PTX, then ptxas lowers PTX to SASS, the arch-specific machine code. Controlling optimization requires passing flags to both host and device phases; for kernels the critical knobs come through -Xptxas. Remember that host-only flags like -O3 do not automatically affect device-side lowering.
Kernel generation and libraries. CUTLASS provides parametric templates for GEMM and conv that implement warp-level tiling, Tensor Core MMA pipelines, and shared-memory iterators designed for conflict-free access. CUTLASS is a canonical reference for writing high-performance kernels and for understanding how to map tiles to hardware fragments. cuDNN 9 added runtime fusion engines, native CUDA Graph integration for those engines, and tuned updates for newer compute capabilities, materially reducing dispatch overheads for Transformer workloads.
Performance implications. Replacing unfused operator sequences with cuDNN attention fusion typically cuts kernel launches and global memory traffic; combined with CUDA Graphs this reduces CPU bottlenecks for short-sequence inference. On Hopper and Blackwell, aligning tile shapes to WGMMA and WMMA native fragment sizes is decisive: mis-sized tiles waste tensor-core throughput. Use CUDA when you need maximal control over instruction selection, occupancy, and shared-memory choreography or when you must extend kernels beyond library coverage while remaining on NVIDIA hardware.
ROCm: HIP and the LLVM toolchain, rocBLAS/MIOpen, and the 6.x improvements
Compiler path. ROCm uses Clang/LLVM to compile HIP into the vendor ISA. Recent 6.x releases focused on performance and framework coverage; release notes track component-level optimizations and expanded HW and OS support.
Libraries and kernels. rocBLAS and MIOpen implement GEMM and conv primitives with arch-aware tiling and algorithm selection, similar in spirit to cuBLAS and cuDNN. ROCm has been improving Triton enablement on AMD GPUs so Python-level kernel authoring can lower through LLVM to AMD backends.
Performance implications. On AMD hardware, matching LDS (shared memory) bank widths and vectorized global loads to matrix tile shapes is as pivotal as shared-memory bank alignment on NVIDIA. Compiler-assisted fusion in frameworks plus library autotuning in rocBLAS and MIOpen often closes a large fraction of the gap to handwritten kernels, depending on architecture and driver. Use ROCm when you need native optimization on AMD accelerators and want HIP portability from CUDA-style kernels.
Triton: a Python-embedded DSL and compiler for custom kernels
Compiler path. Triton is a Python-embedded DSL that lowers via LLVM and handles vectorization, memory coalescing, and register allocation while exposing block-size and program-id controls. Triton automates many error-prone CUDA-level optimizations while letting authors choose block-level tiling.
Optimizations and autotuning. Triton exposes autotuning over tile sizes, num_warps, and pipelining stages. It supports static masking to avoid scalar fallbacks at boundaries, shared-memory staging, and software pipelining to overlap global loads and compute.
Performance implications. Triton excels for fused, shape-specialized kernels that fall outside library coverage, such as bespoke attention variants or fused normalization-activation-matmul chains. On modern NVIDIA architectures vendor collaborations have reduced the penalty versus CUTLASS-style kernels for common GEMMs. Choose Triton when you want near-CUDA performance for custom fused ops without writing SASS or WMMA code, and you value Python-first iteration.
TensorRT and TensorRT-LLM: builder-time graph optimization for inference
Compiler path. TensorRT ingests ONNX or framework graphs and emits a hardware-specific engine. The builder phase performs layer and tensor fusion, precision calibration for INT8 or FP8/FP16, and kernel tactic selection. TensorRT-LLM extends these capabilities with LLM-targeted runtime optimizations.
Optimizations. Builder-time work includes constant folding, concat-slice canonicalization, conv-bias-activation fusion, and attention fusion. Precision workflows cover post-training calibration and smooth-quant or QAT integrations. Runtime features include paged KV caches, in-flight batching, and scheduling for multi-stream and multi-GPU deployments.
Performance implications. The biggest wins come from end-to-end INT8 or FP8 when supported, removing framework overhead by exporting a single engine, and aggressive attention fusion. TensorRT produces per-architecture engine plans that avoid generic kernels at runtime, yielding substantial inference throughput and latency improvements in production.
Practical guidance: choosing and tuning the stack
- Training and experimental kernels: prefer CUDA + CUTLASS on NVIDIA or ROCm + rocBLAS/MIOpen on AMD; use Triton when you need fused custom ops.
- Production inference on NVIDIA: use TensorRT or TensorRT-LLM to capture global graph-level gains and quantization benefits.
- Exploit architecture-native instructions: map tiles to WGMMA/WMMA on Hopper/Blackwell and align LDS usage and vector widths on AMD.
- Fuse first, then quantize: reduce memory traffic with fusion, then apply quantization to increase math density. TensorRT builder-time fusions plus INT8/FP8 often deliver multiplicative gains.
- Use graph execution for short sequences: CUDA Graphs integrated with cuDNN attention fusions amortize launch overheads in autoregressive inference.
- Treat compiler flags as first-class: for CUDA remember device-side flags such as -Xptxas -O3,-v and -Xptxas -O0 for debugging; host-only -O3 is not sufficient for device optimization.
References are available from NVIDIA, AMD, Triton, and library repositories for further deep dives and release notes.