We find that gPU profiling is the only reliable way to know where time goes. Without it, optimization is guesswork. The problem is that NVIDIA’s profiling ecosystem has multiple overlapping tools, and choosing the wrong one for the question you’re asking wastes time and produces misleading conclusions. This article covers the three tools that matter for most GPU workloads — Nsight Systems, Nsight Compute, and the older nvprof — and gives a practical workflow for moving from “my GPU workload is slow” to “here is the specific bottleneck and what to do about it.” The Three Tools and What They Answer Tool Question Answered Granularity Overhead Nsight Systems Where does time go across the full system? Timeline, API calls, CPU/GPU overlap Low Nsight Compute Why is a specific kernel slow? Per-kernel hardware metrics High nvprof (deprecated) Basic kernel timing Kernel-level Medium The decision tree is simple: start with Nsight Systems for system-level understanding, then drill into specific kernels with Nsight Compute. Never start with Nsight Compute — it imposes significant overhead and you won’t know which kernels deserve attention. Nsight Systems: The Starting Point Nsight Systems captures a timeline of everything: CPU threads, CUDA API calls, kernel executions, memory transfers, and NVTX annotations. It answers questions like: Is the GPU idle while the CPU prepares data? Is memory transfer overlapping with compute? Is there a sequence of small kernels where launch overhead dominates? Which kernels consume the most wall time? Basic command-line usage: nsys profile --trace=cuda,nvtx,osrt \ --output=report \ python train.py Open the .nsys-rep file in the Nsight Systems GUI. The timeline view immediately shows CPU/GPU overlap (or lack of it). Look for: GPU idle gaps between kernels — often indicate CPU-side data preparation or Python overhead PCIe transfers (HtoD/DtoH) that block kernel execution A long tail of tiny kernels — operator fusion or batching may help In our experience, the most common finding at this stage is not a slow kernel — it’s excessive CPU-GPU synchronization or synchronous memory transfers that could be pipelined. Nsight Compute: Kernel-Level Diagnosis Once Nsight Systems identifies which kernels dominate runtime, Nsight Compute provides per-kernel hardware metrics: memory throughput, compute throughput, occupancy, warp stall reasons, and instruction-level data. Basic command-line usage: ncu --set full \ --target-processes all \ --output report \ python train.py The --set full flag collects all metric groups but is expensive — for large workloads, use --kernel-name to target a specific kernel. Reading the Roofline Nsight Compute’s Roofline model chart plots your kernel’s achieved compute throughput against its achieved memory bandwidth, relative to hardware limits. The position tells you the bottleneck type: Below memory bandwidth ceiling, left side of ridge point: memory-bound — optimize memory access patterns, reduce HBM round-trips Below compute ceiling, right side of ridge point: compute-bound — improve arithmetic intensity, reduce redundant work Far below both ceilings: launch-bound or occupancy-limited — check block/grid configuration What are the common Bottleneck Patterns? Warp stall on memory (LG Throttle): Global memory loads are stalling the pipeline. Check memory access coalescing. Consecutive threads should access consecutive addresses. Low occupancy: Too many registers per thread or too much shared memory per block prevents the scheduler from keeping enough warps in flight to hide latency. Use --ptxas-options=-v during compilation to see register counts. High L2 miss rate: Data is not reusing the L2 cache. Tiling or blocking may improve locality. Unbalanced SM utilization: Some SMs are finishing much earlier than others. Usually caused by irregular work distribution — consider load balancing across blocks. Profiling Workflow Checklist Build with debug symbols: nvcc -lineinfo enables source correlation in Nsight Compute Add NVTX annotations to your code to label regions in the Nsight Systems timeline Run Nsight Systems first — identify the top 3 kernels by wall time Check CPU/GPU overlap — is the GPU ever idle waiting for the CPU? Run Nsight Compute on top kernels only — avoid profiling the entire workload Check the Roofline — determine if each kernel is memory-bound or compute-bound Check warp stall reasons — these directly indicate what hardware resource is the constraint Verify changes with a second profile — never assume an optimization helped without measurement Interpreting Memory Throughput Numbers Nsight Compute reports memory throughput as a percentage of peak. A kernel achieving 60% of peak HBM bandwidth on an A100 (2 TB/s peak) is actually doing well — in practice, 70–80% is achievable only for very regular access patterns on fully coalesced kernels. Compute-bound kernels typically show 5–20% memory utilization, which is expected. For reference: GPU Peak HBM Bandwidth Typical Achievable NVIDIA A100 80GB 2,000 GB/s 1,400–1,600 GB/s NVIDIA H100 SXM 3,350 GB/s 2,400–2,800 GB/s NVIDIA RTX 4090 1,008 GB/s 700–850 GB/s These are rough practical ceilings — exact numbers depend on access patterns and kernel characteristics. Connecting Profiling to Optimization Strategy Profiling data should directly dictate the optimization path. Memory-bound kernels need better access patterns, fusion, or caching. Compute-bound kernels need algorithmic restructuring or higher-precision-to-lower-precision conversion. Launch-bound workloads need larger batch sizes or kernel consolidation. The hub article How to Profile GPU Kernels to Find the Real Bottleneck covers the full decision framework, including when profiling data suggests architectural changes rather than kernel-level fixes. Looking ahead Use Nsight Systems first, always. It gives a system-level timeline with low overhead and quickly reveals whether the bottleneck is in kernels, memory transfers, or CPU overhead. Then use Nsight Compute to diagnose specific slow kernels using the Roofline model and warp stall analysis. Profile before and after every optimization. Unverified optimizations frequently degrade performance on hardware different from the development machine.