GPZ: GPU-First Lossy Compression That Solves Particle-Data Bottlenecks

Why particle (point-cloud) data is hard to compress

Particle datasets represent systems as collections of discrete points rather than structured grids. That lack of regular spatial structure means low spatial and temporal coherence and very little redundancy — precisely what traditional lossless and many generic lossy compressors depend on. As a result, common approaches like downsampling sacrifice fidelity and reproducibility, while mesh-focused compressors deliver poor ratios and low GPU throughput when applied to particle data.

Real examples show the scale of the problem: a single cosmological snapshot from Summit was 70 TB and national-scale terrain point clouds easily exceed hundreds of terabytes. Handling, storing, and analyzing these volumes without creating GPU performance bottlenecks is a major challenge for fields such as cosmology, geology, molecular dynamics, and 3D imaging.

GPZ pipeline: four GPU-optimized stages

GPZ is designed end-to-end for modern GPUs and particle-specific data characteristics. Its compression pipeline runs in parallel on the device and consists of four primary stages:

Spatial quantization: Floating-point particle positions are mapped to integer segment IDs and offsets with respect to user-specified error bounds. The implementation favors FP32 arithmetic for peak GPU throughput and chooses segment sizes to maximize occupancy.
Spatial sorting: Within each block (mapped to a CUDA warp) particles are sorted by segment ID using warp-level primitives to avoid heavy synchronization. Block-level sorting strikes a balance between compression ratio and shared-memory pressure.
Lossless encoding: A parallel combination of run-length and delta encoding removes redundancy from sorted segment IDs and quantized offsets. Bit-plane coding further eliminates zero bits, with careful attention to GPU-friendly memory access.
Compacting: A three-step device-level assembly compresses blocks into contiguous output buffers with minimal synchronization, achieving memory throughput close to hardware limits (authors report up to 809 GB/s on an RTX 4090).

Decompression mirrors these steps in reverse: extract, decode, and reconstruct positions while keeping the user-provided error bounds, enabling high-fidelity post-hoc analysis.

Hardware-aware optimizations

GPZ stands out because it aggressively tunes for GPU microarchitecture. Key techniques include:

Memory coalescing and 4-byte alignment to maximize DRAM bandwidth and avoid strided penalties.
Conservative register and shared-memory use to maintain high occupancy; FP32 precision is preferred where it doesn't violate error constraints.
One-warp-per-block mapping, explicit use of CUDA intrinsics (e.g., FMA), and loop unrolling to extract compute performance.
Replacing expensive division and modulo operations with precomputed reciprocals and bitwise masks where appropriate.

These choices let GPZ exploit GPU parallelism and memory systems in ways that generic compressors typically do not.

Benchmarks across real-world datasets and GPUs

The authors evaluated GPZ on six particle datasets from cosmology, geology, plasma physics, and molecular dynamics across three GPU classes: consumer (RTX 4090), data-center (H100 SXM), and edge (Nvidia L4). Baselines included cuSZp2, PFPL, FZ-GPU, cuSZ, and cuSZ-i — tools often designed for structured scientific meshes.

GPZ remained robust even when other tools failed or degraded sharply on datasets larger than a few gigabytes. Reported highlights:

Throughput: up to 8x faster than the next-best compressor, with average compression throughputs reported as 169 GB/s (L4), 598 GB/s (RTX 4090), and 616 GB/s (H100). Decompression rates were even higher.
Compression ratio: up to 600% better in difficult settings, and typically far superior across experiments. In cases where another method was marginally better on ratio, GPZ still maintained a 3x–6x speed advantage.
Data quality: rate–distortion curves showed higher PSNR at lower bitrates, and visual comparisons (including 10x magnified views) found GPZ reconstructions nearly indistinguishable from originals while competitors produced visible artifacts.

What this means for scientific workflows

GPZ raises the practical ceiling for in-situ and post-hoc compression of particle data on GPUs. For practitioners it delivers:

Error-bounded, high-throughput compression that preserves scientific fidelity for downstream analysis and visualization.
Practical performance on both consumer and HPC GPUs, enabling workflows that would otherwise require massive storage or aggressive lossy reductions.
A template for designing device-focused compression algorithms: prioritize hardware-aware primitives, minimize synchronization, and trade precision where safe to sustain peak throughput.

As datasets continue to grow, GPU-first compressors like GPZ are likely to become a core component of large-scale simulation and point-cloud processing stacks.

Source: https://arxiv.org/abs/2508.10305