How CUDA, ROCm, Triton and TensorRT Shape GPU AI Performance: Compiler Paths and Tuning Tips

Deep-learning throughput depends on how effectively a compiler stack maps high-level tensor programs to GPU execution: thread and block schedules, memory movement, and instruction selection for matrix pipelines. Below I compare four dominant stacks from a compiler and tuning perspective and highlight which optimizations matter most in practice.

What drives performance on modern GPUs

Across vendors the same levers determine performance:

CUDA: nvcc/ptxas, cuDNN, CUTLASS, and CUDA Graphs

Compiler path. CUDA code compiles through nvcc into PTX, then ptxas lowers PTX to SASS, the arch-specific machine code. Controlling optimization requires passing flags to both host and device phases; for kernels the critical knobs come through -Xptxas. Remember that host-only flags like -O3 do not automatically affect device-side lowering.

Kernel generation and libraries. CUTLASS provides parametric templates for GEMM and conv that implement warp-level tiling, Tensor Core MMA pipelines, and shared-memory iterators designed for conflict-free access. CUTLASS is a canonical reference for writing high-performance kernels and for understanding how to map tiles to hardware fragments. cuDNN 9 added runtime fusion engines, native CUDA Graph integration for those engines, and tuned updates for newer compute capabilities, materially reducing dispatch overheads for Transformer workloads.

Performance implications. Replacing unfused operator sequences with cuDNN attention fusion typically cuts kernel launches and global memory traffic; combined with CUDA Graphs this reduces CPU bottlenecks for short-sequence inference. On Hopper and Blackwell, aligning tile shapes to WGMMA and WMMA native fragment sizes is decisive: mis-sized tiles waste tensor-core throughput. Use CUDA when you need maximal control over instruction selection, occupancy, and shared-memory choreography or when you must extend kernels beyond library coverage while remaining on NVIDIA hardware.

ROCm: HIP and the LLVM toolchain, rocBLAS/MIOpen, and the 6.x improvements

Compiler path. ROCm uses Clang/LLVM to compile HIP into the vendor ISA. Recent 6.x releases focused on performance and framework coverage; release notes track component-level optimizations and expanded HW and OS support.

Libraries and kernels. rocBLAS and MIOpen implement GEMM and conv primitives with arch-aware tiling and algorithm selection, similar in spirit to cuBLAS and cuDNN. ROCm has been improving Triton enablement on AMD GPUs so Python-level kernel authoring can lower through LLVM to AMD backends.

Performance implications. On AMD hardware, matching LDS (shared memory) bank widths and vectorized global loads to matrix tile shapes is as pivotal as shared-memory bank alignment on NVIDIA. Compiler-assisted fusion in frameworks plus library autotuning in rocBLAS and MIOpen often closes a large fraction of the gap to handwritten kernels, depending on architecture and driver. Use ROCm when you need native optimization on AMD accelerators and want HIP portability from CUDA-style kernels.

Triton: a Python-embedded DSL and compiler for custom kernels

Compiler path. Triton is a Python-embedded DSL that lowers via LLVM and handles vectorization, memory coalescing, and register allocation while exposing block-size and program-id controls. Triton automates many error-prone CUDA-level optimizations while letting authors choose block-level tiling.

Optimizations and autotuning. Triton exposes autotuning over tile sizes, num_warps, and pipelining stages. It supports static masking to avoid scalar fallbacks at boundaries, shared-memory staging, and software pipelining to overlap global loads and compute.

Performance implications. Triton excels for fused, shape-specialized kernels that fall outside library coverage, such as bespoke attention variants or fused normalization-activation-matmul chains. On modern NVIDIA architectures vendor collaborations have reduced the penalty versus CUTLASS-style kernels for common GEMMs. Choose Triton when you want near-CUDA performance for custom fused ops without writing SASS or WMMA code, and you value Python-first iteration.

TensorRT and TensorRT-LLM: builder-time graph optimization for inference

Compiler path. TensorRT ingests ONNX or framework graphs and emits a hardware-specific engine. The builder phase performs layer and tensor fusion, precision calibration for INT8 or FP8/FP16, and kernel tactic selection. TensorRT-LLM extends these capabilities with LLM-targeted runtime optimizations.

Optimizations. Builder-time work includes constant folding, concat-slice canonicalization, conv-bias-activation fusion, and attention fusion. Precision workflows cover post-training calibration and smooth-quant or QAT integrations. Runtime features include paged KV caches, in-flight batching, and scheduling for multi-stream and multi-GPU deployments.

Performance implications. The biggest wins come from end-to-end INT8 or FP8 when supported, removing framework overhead by exporting a single engine, and aggressive attention fusion. TensorRT produces per-architecture engine plans that avoid generic kernels at runtime, yielding substantial inference throughput and latency improvements in production.

Practical guidance: choosing and tuning the stack

References are available from NVIDIA, AMD, Triton, and library repositories for further deep dives and release notes.