Most production GPU stories begin with a shiny profiler screenshot and end with a pager going off at night. The graph said "100% utilization," the dashboard said "we're fine," and yet the latency budget kept missing its marks in the wild. That contradiction is not a failure of tools; it is a misreading of what those tools actually measure and how traffic behaves when humans point workloads at it.
The first lie is the average. Aggregated utilization can sit at 90"“100% while the user experience drifts because the distribution under the average keeps changing. Bursty arrivals, tail"‘heavy distributions, and queue coupling across kernels can turn healthy means into painful p95s. When the queue spikes, your batcher overfills, the kernel runs longer than the tight loop you profiled, and downstream stages wait just long enough to miss their own windows.
The second lie is the wrong box. Teams profile a single kernel on a developer workstation, then extrapolate to a multi"‘tenant production box with NUMA effects, PCIe contention, mixed precision strategies, and neighbor jobs fighting for time slices. Your "100%" on a toy workload doesn't include the real DMA, host"‘to"‘device, and device"‘to"‘host path that customers actually trace.
The fix is not hero kernels; it's an honest pipeline map and a bias to measurements that match the user promise. Start with one statement: "We promise N tokens per second at p95 under M concurrent streams." Now build the measurement harness to that promise. If you don't have a real harness, the queue will keep lying and the kernel graphs will keep flirting.
What to measure, specifically:
- Arrival processes. Capture inter"‘arrival times and variance for real tenants, not a synthetic Poisson. Regulations and human scheduling create visible daily and weekly spikes that shift effective throughput.
- Queue depth and time in queue per stage. A simple histogram per stage will reveal whether your bottleneck is compute, memory movement, or the innocent little pre"‘processor that keeps stalling on the filesystem.
- Occupancy and achieved occupancy. Don't stop at theoretical occupancy; track how many warps are actually resident for the specific register and shared memory usage you picked. That one struct you kept "for clarity" can push you over a register file boundary.
- HtoD/DtoH overlaps. If your timeline shows serialized transfers and kernels, you're losing free speed. Multiple CUDA streams or equivalent can hide transfer time under compute if your access patterns are sane.
Two production anecdotes that changed my mind:
-
We chased a "slow model" for a week before realizing that the batcher's micro"‘timeout"”added "just to be safe""”was tuned for a development traffic pattern. In production, the timeout aligned with a predictable lull, which meant we were shipping near"‘empty batches right before the surge. Removing the safety timer and switching to a dynamic target batch size (with a ceiling) brought p95 back into budget. The model was never the villain; the queueing policy was.
-
A beloved preprocessing step"”byte"‘pair encoding on the CPU with a legacy implementation"”looked cheap in isolation and then dominated end"‘to"‘end time under concurrency. The GPU sat politely at 60% while CPU threads fought over caches and the allocator went sightseeing. Moving tokenization on"‘device and collapsing two passes into one reclaimed more throughput than the next two "kernel optimizations" combined.
Concretely, here is a sequencing plan that has worked across several teams:
-
Put the user promise in code: a small smoke test that asserts p95 for a known workload on a known shape, run nightly. If it fails, the build is red. No debates.
-
Build a thin, honest dashboard: arrivals, queue depth, p50/p95 latency per stage, achieved occupancy, and transfer overlap percentage. Five charts, no gradients, just numbers you can read in the war room.
-
Attack memory movement first. Kernel micro"‘optimizations are fun, but it is astonishing how often pinned memory, coalesced access, and moving a tiny pre/post step onto the device wins the week. Only then touch the innards.
-
Make batching adaptive. Fixed batch sizes are a lie in the presence of bursty workloads. Implement target"‘size with max"‘size and a percent"‘full admission policy that adapts to arrivals.
-
Establish a "performance change review" that is blameless and quick. Engineers shouldn't fear changing a kernel because they might miss an edge case; there should be a clear rollback and a culture that treats misses as information, not guilt.
This last point matters more than people admit. Teams that ship fast performance fixes do so because they don't light each other on fire when a fix backfires. Yes, measure hard, but also let people catch their breath and try again. Your throughput will thank you as much as your humans.
Case notes
NVIDIA's best practices for occupancy and memory coalescing remain the most bang"‘for"‘buck reading for practitioners. Pair that with your own Nsight Systems traces under real traffic.
Queueing models that incorporate burstiness (e.g., M/G/1 with heavy tails) map closer to reality in consumer apps than Poisson fantasies. Treat your batcher like a first"‘class service.