GPU Throughput From First Principles

Occupancy, memory, and math, treated as one system.

Start with the roofline. Paper first, numbers second. Estimate peak FLOPs and effective bandwidth for your device, then sketch the slanted line and the flat one. Now place your kernels where they live: memory bound along the left, compute bound up on the shelf. This is not decoration. It is a promise about where time will go when you begin to change things.

Memory is the common bottleneck. Coalesce reads so that a warp pulls from contiguous addresses. The hardware is kind when lanes move together. Prefer structure of arrays to array of structures, most of the time, because fields you touch together should sit together in memory. Keep hot data in shared memory, but treat that space like a crowded kitchen. Name each dish, clean between courses, and avoid bank conflicts the way a chef avoids burns. Where it helps, fuse small kernels, but only when the reduction in DRAM trips outweighs the cost in registers and occupancy.

Occupancy is not a religion, it is a tool. Aim for enough resident warps to hide latency. Past that point, adding more can starve each warp of registers and cache. Use the compiler's report as a map, not an oracle. If a kernel spills registers to local memory, change the shape of work: trim live ranges, precompute a value once per block, or split the kernel at a seam where the data is naturally staged.

Threads and blocks define how your problem meets the machine. Keep blocks large enough to occupy at least a warp or two, but shaped to match the pattern of your memory. When reading tiles, map axes so that the fastest moving index in your loop sits with the fastest moving address in memory. Watch for striding that turns a simple walk into a series of expensive leaps.

Atomics can protect correctness, and they can slow a program to a crawl when contention climbs. Reduce where you can. Summarize within a warp using shuffle intrinsics, then within a block using shared memory, then across blocks with a small number of atomics. When a global counter must be updated by many threads, consider a tree of buffers rather than a single line at a single door.

Control flow is another quiet source of waste. Divergence within a warp means lanes take turns rather than walk together. Sometimes the data demands it. Often a small restructure removes the conditional from the inner loop, or replaces it with predication that allows all lanes to perform the same instructions then mask the result.

Precision is a lever with costs. Half precision can double arithmetic throughput on the right hardware and halve memory traffic, but only when the error bars remain narrow. Keep accumulators at higher precision where sums grow long. Measure the end effect, not only the kernel's speed, because a fast wrong answer will teach you the wrong lessons.

Finally, measure with a clock you trust. Warm the device. Run enough iterations to rise above jitter. Use Nsight or rocprof to see where stalls live. Change one thing at a time. Write down the result in a lab book the whole team can read. Performance work is not a sprint toward a single clever trick. It is a patient conversation with the machine, a steady narrowing of the gap between what the silicon can do and what your code invites it to do today.

Case notes

Roofline analysis gives teams a shared visual language to reason about whether to chase memory or math first.

Memory coalescing and avoiding bank conflicts are perennial wins in CUDA programming; NVIDIA's best practices consolidate the patterns and common pitfalls.

GPU Throughput From First Principles

Case notes

Related