What does CUDA actually stand for? CUDA stands for Compute Unified Device Architecture. NVIDIA introduced it in 2006 as a parallel computing platform and programming model that allows developers to use NVIDIA GPUs for general-purpose computing — not just graphics rendering. The “Unified” in CUDA refers to the unification of the GPU’s programmable shader processors into a single pool of general-purpose compute units (CUDA cores). Before CUDA, GPU programming required mapping computational problems onto graphics operations (vertex shaders, pixel shaders). CUDA removed this constraint by providing a C-like programming interface that treats the GPU as a massively parallel processor. Is CUDA only for NVIDIA? Yes. CUDA is proprietary to NVIDIA hardware. Code written using CUDA’s programming model, libraries (cuBLAS, cuDNN, cuFFT), and runtime API runs exclusively on NVIDIA GPUs. This hardware lock-in is the most significant practical consequence of CUDA’s dominance in AI: the vast majority of AI frameworks (PyTorch, TensorFlow, JAX) are optimised primarily for CUDA, which means they run best — and sometimes only — on NVIDIA hardware. The alternatives: Platform Vendor GPU Support AI Framework Maturity CUDA NVIDIA NVIDIA only Dominant — all major frameworks ROCm AMD AMD only Improving — PyTorch support, partial TensorFlow oneAPI/SYCL Intel Intel, with cross-vendor ambitions Early — limited framework integration Metal Apple Apple Silicon Limited — MLX framework only OpenCL Khronos Group Cross-vendor Legacy — minimal AI framework support For a detailed comparison of CUDA against OpenCL and emerging alternatives, our analysis of GPU programming platforms covers the performance and portability tradeoffs. Why does CUDA dominate AI? CUDA’s dominance is not purely technical — it is an ecosystem effect. NVIDIA invested heavily in AI-specific libraries (cuDNN for neural network primitives, TensorRT for inference optimisation, NCCL for multi-GPU communication) years before competitors. These libraries are deeply integrated into AI frameworks: when PyTorch executes a convolution operation, it calls cuDNN, which calls CUDA, which runs on NVIDIA hardware. Replacing any layer of this stack requires replacing the entire stack. The practical implication for AI practitioners: if your work uses PyTorch or TensorFlow, you are using CUDA whether or not you write CUDA code directly. The frameworks abstract the GPU programming interface, but the underlying compute path is CUDA. This abstraction is why most AI engineers never write CUDA code but are still dependent on NVIDIA hardware. Our assessment: CUDA lock-in is a real constraint that increases hardware costs (NVIDIA GPUs command premium pricing) and limits vendor choice. For most AI teams, the software ecosystem advantages outweigh this cost. For teams with the engineering capacity to invest in platform portability, ROCm on AMD hardware offers a viable — if less mature — alternative at lower hardware cost. What does a CUDA programmer actually write? CUDA programming involves writing “kernel” functions that execute on the GPU in parallel across thousands of threads. A simple CUDA kernel looks like C code with annotations that specify how work is distributed across GPU cores. The CUDA toolkit includes the nvcc compiler that compiles these kernels into GPU-executable code. Most AI practitioners do not write CUDA kernels directly. Framework developers (PyTorch, TensorFlow), library developers (cuDNN, FlashAttention), and hardware vendors write the CUDA kernels that AI workloads execute. End users interact with Python APIs that trigger kernel execution transparently. Understanding what CUDA is and how it works helps AI engineers diagnose performance issues and understand hardware constraints — even when they never write a line of CUDA code. How has CUDA evolved since its introduction? CUDA has progressed through multiple compute capability versions, each adding features that expand what GPU programs can do. The evolution reflects NVIDIA’s strategy of tying hardware capabilities to software features: Compute Capability 3.x (Kepler, 2012): Introduced dynamic parallelism — GPU kernels can launch other kernels without CPU involvement. This enabled recursive algorithms and adaptive workloads on the GPU. Compute Capability 7.x (Volta, 2017): Introduced Tensor Cores — specialised matrix multiplication units that accelerate deep learning operations by 4–8× compared to standard CUDA cores. Compute Capability 8.x (Ampere, 2020): Added sparsity support in Tensor Cores (2:4 structured sparsity for 2× throughput) and TF32 precision (19-bit format that provides FP32-range with reduced precision). Compute Capability 9.0 (Hopper, 2022): Added the Transformer Engine (automatic FP8/FP16 precision management) and Thread Block Clusters (hardware-supported inter-SM communication). Each generation adds capabilities that require new CUDA toolkit versions to access. This creates a version dependency chain: a specific GPU requires a minimum CUDA version, which requires a minimum driver version, which may require a specific OS kernel version. Managing these dependencies is one of the most common sources of setup friction in AI development environments. We maintain tested dependency matrices for each hardware configuration we deploy. The time invested in validating these configurations prevents the hours-long debugging sessions that result from version mismatches — a common frustration for teams new to GPU computing.