CUDA driver and toolkit are two benchmark environment variables, not one A reproducible AI benchmark discloses both the CUDA driver version and the CUDA toolkit version, because they are two separate components with different installation sources, update cycles, and compatibility rules. Treating them as interchangeable is the most common reason a published benchmark number cannot be reproduced on a “similar” stack — and the root cause of most “my GPU code won’t run” tickets we see. The instruction “install CUDA” papers over the split and produces a steady stream of installation failures, version-mismatch errors, and silent performance regressions. The two halves: CUDA Driver — installed with the NVIDIA GPU driver package. Provides the kernel module and the user-space driver that hardware depends on. This is the layer that talks to the silicon. CUDA Toolkit — the developer tools: the nvcc compiler, libraries (cuBLAS, cuDNN, NCCL), headers, and profiling tools. This is the layer that lets you build CUDA code. For AI frameworks like PyTorch and TensorFlow, neither component needs to be explicitly installed when you use pre-built binaries. The frameworks bundle their own CUDA runtime, their own cuBLAS, and their own cuDNN — pinned to specific versions at build time. But understanding the separation is still essential, because the moment something fails, you need to know which layer broke. Component breakdown Component What it provides Installed by NVIDIA GPU Driver Hardware interface, kernel module NVIDIA driver package CUDA Driver API Low-level GPU control interface Included with GPU driver CUDA Runtime (libcudart) Higher-level CUDA API CUDA Toolkit or framework bundle cuBLAS, cuDNN Optimised ML primitives CUDA Toolkit or framework bundle nvcc CUDA C++ compiler CUDA Toolkit NCCL Multi-GPU communication Separate package or framework bundle The asymmetry matters: the driver is host-level state shared across everything on the machine, while the runtime, cuBLAS, and cuDNN can be bundled per-process inside a Python wheel or a container image. This is why two PyTorch installs on the same box can ship different cuDNN versions without conflict, but only one CUDA driver can be loaded into the kernel at a time. Compatibility rules The CUDA driver version bundled with the GPU driver sets the ceiling on the CUDA versions you can use. Roughly: Driver 525.x → supports up to CUDA 12.0 Driver 535.x → supports up to CUDA 12.2 Driver 545.x → supports up to CUDA 12.3 Driver 550.x → supports up to CUDA 12.4 The CUDA Toolkit version you install must be at or below the driver’s maximum. Backward compatibility runs in the expected direction — you can run CUDA 11.x code on a driver that supports CUDA 12.x — but the reverse is not true. A toolkit newer than the driver fails immediately with forward-compatibility errors, even though the GPU hardware itself is fine. NVIDIA’s published driver-to-CUDA support matrix is the authoritative source for these numbers; the table above is a planning heuristic, not a benchmark, and the specifics drift each release. For AI frameworks PyTorch and TensorFlow pre-built binaries bundle the CUDA runtime and cuDNN. You do not need to install the CUDA Toolkit separately for these frameworks. You need exactly two things: An NVIDIA GPU driver with a version sufficient for the CUDA version the framework binary was compiled against. The framework binary itself, which carries libcudart, cuBLAS, and cuDNN inside the wheel. The CUDA Toolkit is only required when you are compiling custom CUDA extensions, building a framework from source, or writing your own kernels. For pure PyTorch / TensorFlow inference and training workloads — which covers the vast majority of what we see in production — the Toolkit is unnecessary and its absence eliminates an entire class of version-drift bugs. How do you verify your setup? A minimal verification sequence: # Check driver version and maximum supported CUDA nvidia-smi # Check CUDA Toolkit version (if installed) nvcc --version # Check what CUDA version PyTorch was compiled against python -c "import torch; print(torch.version.cuda)" # Verify GPU is accessible from PyTorch python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))" These four commands separate the four layers — kernel driver, host toolkit, framework-bundled runtime, and runtime device handshake — and that separation is exactly what makes diagnosis tractable. If nvidia-smi succeeds and torch.cuda.is_available() returns False, you know the driver is fine and the problem is in the framework layer. If nvcc --version reports a different number than torch.version.cuda, you know the host toolkit and the bundled runtime have diverged — usually harmless, but worth flagging if you also build custom extensions. The deeper point: correct CUDA stack configuration is not just a functional concern. As the software stack is a first-class performance component lays out, driver and runtime versions move measured throughput on the same silicon. Treating CUDA setup as “either it works or it doesn’t” misses the middle case where it works but slowly. How do you diagnose CUDA version mismatches? CUDA version mismatches produce three distinct error categories, each pointing at a different layer. Driver–toolkit incompatibility surfaces as CUDA error: forward compatibility was not enabled. Diagnosis: nvidia-smi shows the driver version and the maximum CUDA version it supports; nvcc --version shows the installed toolkit. If the toolkit exceeds the driver’s maximum, the toolkit cannot communicate with the GPU. The fix is either upgrading the driver or downgrading the toolkit — and in practice, the driver upgrade is almost always the right move, because nothing else is downstream of it. Toolkit–framework incompatibility surfaces as RuntimeError: CUDA error: no kernel image is available for execution. PyTorch binaries are compiled against a specific CUDA version, and python -c "import torch; print(torch.version.cuda)" reveals which one. If this differs from the installed toolkit by more than a minor version, kernel execution can fail for operations that changed between CUDA versions. The fix is usually to reinstall PyTorch from the wheel that matches your toolkit, not the other way around. Runtime / compile-time mismatches are the subtle category — and the one that produces the most confusing tickets. cuDNN and cuBLAS must match the CUDA toolkit version they were built against. Mixing cuDNN from CUDA 11.8 with a CUDA 12.1 toolkit produces undefined-symbol errors that appear only when specific operations are called, not at import time. This is the observed pattern that costs the most debugging hours, because the program runs fine until the first convolution. We diagnose it by checking library versions explicitly: python -c "import torch; print(torch.backends.cudnn.version())" against the expected version for the installed toolkit. Our standard practice is to use PyTorch’s bundled CUDA runtime and cuDNN (installed via pip) and ensure the host driver meets the minimum version requirement. This eliminates toolkit management entirely. The only host-level CUDA component becomes the driver, which is the one layer you cannot bundle into a Python wheel anyway. Managing multiple CUDA toolkit versions Real development environments rarely live with a single CUDA toolkit version for long. One project pins CUDA 11.8 for an older framework build, another requires CUDA 12.4 for the latest features, and a third needs to match whatever the customer’s production cluster runs. The NVIDIA CUDA toolkit supports parallel installation in versioned directories (/usr/local/cuda-11.8, /usr/local/cuda-12.4) with a symlink at /usr/local/cuda pointing at the active version. Switching versions means updating the symlink and the PATH / LD_LIBRARY_PATH environment variables. Environment Modules (module load cuda/12.4) and conda environments both automate this, and they exist precisely to prevent the most common error: building code against one CUDA version while the runtime environment points at another. The symptom is silent — the build succeeds, the program runs, and only a specific kernel path fails much later. For containerised workflows, this entire problem disappears. Each container image bundles its required CUDA toolkit version, and multiple containers with different CUDA versions run simultaneously on the same host without conflict — because the host only contributes the driver, and the driver is forward-compatible across a wide range of CUDA versions. This is one reason we recommend container-based deployment for production AI. It removes CUDA version management from the operational surface and pushes it into image-build time, where it can be solved once and frozen. For any CUDA-using AI benchmark in front of you, does the report pin the host driver and the in-container toolkit together as the runtime pair that determines kernel coverage on your production software stack, or does it name only the toolkit version and leave the driver branch — and the kernel paths it actually loads — implicit?