Not every GPU runs CUDA. CUDA is NVIDIA’s proprietary platform, and it only works on NVIDIA hardware that meets specific architectural requirements. Understanding what makes a GPU “CUDA-capable” — and what capability tier it belongs to — directly affects which features you can use, which libraries are supported, and what performance you can realistically expect. What Makes a GPU CUDA-Capable We find that a CUDA-capable GPU is any NVIDIA GPU with compute capability 1.0 or higher. Practically speaking, anything sold since approximately 2006 qualifies. But compute capability 1.x hardware is so old it’s irrelevant — modern CUDA development targets compute capability 7.0 (Volta) at minimum, and most production AI workloads require 8.0 (Ampere) or higher for tensor core access. The compute capability version is a two-number designation (major.minor) that encodes the architectural feature set available on a given GPU: Compute Capability Architecture Key Features Added 7.0 Volta (V100) Tensor Cores (1st gen), independent thread scheduling 7.5 Turing (T4, RTX 20xx) INT8/INT4 tensor cores, ray tracing RT cores 8.0 Ampere (A100) BF16 tensor cores, TF32, async memory copy, MIG 8.6 Ampere (A10, A30, RTX 30xx) Ampere consumer/datacenter variant 8.9 Ada Lovelace (RTX 40xx, L4, L40) FP8 tensor cores, 4th gen tensor cores 9.0 Hopper (H100) FP8 natively, NVLink 4.0, Transformer Engine We find that the compute capability determines which CUDA intrinsics, PTX instructions, and hardware-accelerated features are available. Libraries like cuDNN and TensorRT query compute capability at initialization and load code paths specific to the detected hardware. The CUDA Software Stack CUDA is not just a runtime — it’s a full software stack sitting between your application code and the GPU hardware: Application (Python, C++, Fortran) ↓ Framework layer (PyTorch, TensorFlow, JAX) ↓ CUDA Libraries (cuDNN, cuBLAS, cuFFT, NCCL) ↓ CUDA Runtime (cudart) + CUDA Driver API ↓ NVIDIA Driver (kernel module) ↓ GPU Hardware (SM, HBM, NVLink) Each layer adds abstraction. Most engineers working in PyTorch or TensorFlow never touch the CUDA Runtime directly — the framework manages kernel launches, memory allocation, and stream synchronization on their behalf. Direct CUDA programming (writing __global__ kernels in C++) is only necessary when the framework’s operator coverage doesn’t meet your requirements or when you need precise control over memory layout and execution scheduling. Streaming Multiprocessors: The Core Architectural Unit The SM (Streaming Multiprocessor) is the fundamental compute unit of a CUDA GPU. All CUDA thread blocks execute on SMs. One block runs on exactly one SM at a time; one SM can run multiple blocks simultaneously if resources allow. Each SM contains: CUDA cores (FP32 and INT32 execution units) Tensor Cores (for matrix-multiply-accumulate operations) Register file (64K 32-bit registers on Ampere) L1 cache / shared memory (configurable split on modern architectures) Warp schedulers (4 per SM on Ampere) The number of SMs per GPU varies significantly across product tiers: GPU SMs CUDA Cores Tensor Cores NVIDIA A10 72 9,216 288 NVIDIA A100 80GB 108 6,912 432 NVIDIA H100 SXM 132 16,896 528 NVIDIA RTX 4090 128 16,384 512 Note that CUDA core count alone is misleading — the A100’s higher SM count with fewer cores per SM is optimized for HPC workloads, while the H100 increases both SMs and throughput per SM. For AI training, the tensor core count and HBM bandwidth matter more than CUDA core count. What CUDA Enables That CPU Code Cannot Match The value proposition of CUDA hardware is parallelism at a scale that CPU architectures can’t match for data-parallel workloads: Memory bandwidth: An H100 SXM provides 3.35 TB/s of HBM bandwidth. A dual-socket CPU system with DDR5 provides roughly 500 GB/s. For memory-bound kernels, that’s a 6x bandwidth advantage before considering compute. Throughput for dense linear algebra: Matrix multiplication at FP16/BF16 precision on H100 tensor cores reaches approximately 2,000 TFLOPS (with sparsity). A high-end CPU performs this at roughly 5–10 TFLOPS. This is the gap that makes GPU training practical for large models. Concurrent execution: The warp scheduler hides memory latency by switching between warps while one is waiting on a memory load. A CPU core stalls or relies on out-of-order execution for the same purpose, with far fewer parallel threads in flight. Checking CUDA Capability Programmatically import torch if torch.cuda.is_available(): for i in range(torch.cuda.device_count()): props = torch.cuda.get_device_properties(i) print(f"Device {i}: {props.name}") print(f" Compute capability: {props.major}.{props.minor}") print(f" Total memory: {props.total_memory / 1e9:.1f} GB") print(f" SM count: {props.multi_processor_count}") The compute capability determines whether your code can use features like FP8 (requires 8.9+), BF16 tensor ops (requires 8.0+), or asynchronous memory copies (cp.async, requires 8.0+). How should you choose a GPU for CUDA Workloads? The compute capability requirement should be the first filter: Minimum for modern AI workloads: 7.0 (Volta) — tensor core access, required by most cuDNN operations Recommended baseline: 8.0 (Ampere) — BF16, TF32, async copy, MIG support For large model inference or training: 9.0 (Hopper) — native FP8, NVLink 4.0, Transformer Engine The broader API selection question — including when OpenCL or SYCL is more appropriate — is covered in CUDA vs OpenCL vs SYCL: Choosing a GPU Compute API. Putting it together A CUDA GPU is any NVIDIA GPU with a compute capability designation. The capability version determines feature availability — tensor cores, precision support, memory management features. The SM is the fundamental execution unit; understanding SM-level resource limits (registers, shared memory, warp slots) is essential for writing kernels that actually saturate the hardware. For most AI and HPC work, compute capability 8.0+ is the practical minimum for accessing the features that make modern GPU compute competitive.