NVIDIA Linux Driver Installation: Correct Steps for AI Workloads

The driver → CUDA → cuDNN → framework chain is the benchmark environment

A reproducible AI benchmark on Linux discloses the full driver → CUDA → cuDNN → framework chain because each link changes the measured number. Get any link wrong and the failures are confusing — mysterious CUDA errors, framework crashes that mention the wrong subsystem, or, worst of all, silent performance degradation where the workload runs but takes the slow kernel path. The hardware looks fine. nvidia-smi reports the GPU. Training still converges. The throughput is just quietly half of what it should be.

In our experience, the proportion of “GPU performance problems” that turn out to be misaligned software stacks is high enough that we now check the driver/toolkit/framework triple before we look at anything else. The hardware ceiling is rarely the binding constraint. The stack you installed last quarter usually is. The installation procedure below is the practical method for assembling that environment; the rest of this article treats each link as a variable to declare in any benchmark methodology that quotes a throughput number on the resulting machine.

The version compatibility chain

GPU hardware
    ↓ requires
NVIDIA driver (e.g., 550.x)
    ↓ determines maximum supported
CUDA version (e.g., CUDA 12.4)
    ↓ combined with
cuDNN version (e.g., 8.9 or 9.x)
    ↓ required by
Framework version (PyTorch 2.x, TensorFlow 2.x)

Each link is a hard constraint, not a suggestion. The driver sets a ceiling on the CUDA runtime version the host can support. The CUDA runtime sets a floor on which cuDNN builds are compatible. The framework wheel — torch==2.4.0+cu124, for example — pins both. Installing the newest PyTorch against a year-old driver is the most common reason torch.cuda.is_available() returns False on an otherwise healthy machine.

Recommended installation method for AI workloads

Use the official NVIDIA package repository or runfile, not the distribution’s NVIDIA packages.

Ubuntu’s nvidia-driver-xxx packages are typically one or two minor versions behind upstream and often ship without the components AI workloads actually need: the CUDA toolkit itself, NCCL for multi-GPU communication, and a matching libcudnn. The packages are fine for desktop graphics. They are not fine for a training node.

# Remove existing packages
sudo apt purge nvidia-* libnvidia-*
sudo apt autoremove

# Install from NVIDIA's package repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4  # Match to your target CUDA version

# Verify
nvidia-smi
nvcc --version

A clean nvidia-smi is necessary but not sufficient. It tells you the kernel module loaded and the driver is talking to the GPU. It says nothing about whether the CUDA runtime that PyTorch will dlopen at import time matches the toolkit you just installed.

Version compatibility table (as of mid-2026)

PyTorch version	Required CUDA	Minimum driver
2.4.x	CUDA 12.1+	525.60
2.3.x	CUDA 12.1+	525.60
2.2.x	CUDA 11.8 or 12.1	450.80
2.1.x	CUDA 11.8 or 12.1	450.80

Always verify against the PyTorch installation matrix for the exact wheel you intend to install. These minimums move; treat the table as a starting point, not canon.

What are the common failure modes?

Symptom	Likely cause	Fix
`CUDA error: no kernel image available`	Compute capability mismatch between the wheel’s compiled kernels and the GPU	Install a wheel built for the correct SM, or recompile from source
`RuntimeError: CUDA not available`	Driver not loaded, or CUDA runtime missing	Reinstall driver, confirm `nvidia-smi`, check `LD_LIBRARY_PATH`
Slow training, no error	Suboptimal kernel selection — cuDNN heuristics, determinism flags, or fallback paths	Audit `CUBLAS_WORKSPACE_CONFIG`, `torch.backends.cudnn.benchmark`, and the framework’s deterministic settings
OOM on first run	Older driver capping addressable VRAM, or framework allocator fragmentation	Update driver; check allocator configuration

The silent slow-training row is the dangerous one. The other three fail loudly and force you to fix them. A workload that runs at 60% of expected throughput because cuDNN’s autotuner picked a conservative kernel on an old driver will quietly cost you weeks of training time before anyone notices. This is one of the structural reasons identical GPUs often perform differently across nominally similar nodes — the GPU model is the same, but the stack underneath isn’t.

What goes wrong during NVIDIA driver installation on Linux?

The most frequent installation-time failures cluster into four categories. Conflicting kernel modules — nouveau still loaded alongside nvidia — produce a driver that “installed” but never bound to the device. Mismatched toolkit and driver versions install cleanly and then fail at runtime when the framework tries to dlopen a CUDA symbol the driver doesn’t export. Incomplete DKMS compilation, usually because the kernel headers package wasn’t installed, leaves you with a driver that works until the next kernel update and then disappears. Secure boot enforcement blocks unsigned kernel modules from loading at all, with no obvious indication that this is what happened.

The diagnostic sequence we use is mechanical. Start with nvidia-smi. If it reports “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver”, the kernel module is not loaded. dmesg | grep -i nvidia will usually show why — a signature failure, a module dependency, or a conflicting nouveau load. If nvidia-smi succeeds but torch.cuda.is_available() returns False, the runtime is the issue, not the driver: typically a missing libcudart.so on the loader path, or a CUDA toolkit version the wheel wasn’t built against.

For production GPU nodes we maintain a pinned, validated configuration: Ubuntu 22.04 LTS on the GA kernel (we avoid the HWE kernel unless specific hardware demands it), NVIDIA driver 550.x from the CUDA repository rather than the Ubuntu archive, CUDA toolkit 12.4 installed from the NVIDIA runfile, and PyTorch installed via pip against the CUDA 12.4 binaries. This is an observed-pattern combination — stable across the production nodes we operate, not a benchmarked claim that any other site will reproduce identically.

Containerised deployments and driver management

For teams running AI workloads under Docker or Kubernetes, the NVIDIA Container Toolkit changes the driver-management model in a useful way. The host needs only the NVIDIA kernel driver. No CUDA toolkit on the host, no cuDNN on the host, no framework installed on the host. Everything above the kernel module lives inside the container image.

The compatibility surface collapses to a single host-side variable: the driver version. Each container image declares its minimum required driver in its metadata, and the container toolkit validates at launch. If the host driver is too old, the container refuses to start with an explicit message — much better than the silent miscompute or wrong-kernel-path failures you get with a host-installed mismatch.

We manage host drivers using a pinned package version in Ansible. Driver updates roll one node at a time, validated by a smoke test — launch a known PyTorch container, run a sixty-second inference job, confirm throughput is within an expected band — before advancing to the next node. The rolling strategy keeps at least three-quarters of GPU capacity available during maintenance windows. The point isn’t the specific automation; it’s that the host-driver version is the only thing that needs that discipline.

Why software ceilings often bind before hardware ceilings

The headline number on a GPU spec sheet — peak FP16 throughput, HBM bandwidth, NVLink rate — is a hardware ceiling. The number you actually measure is almost always limited by something above it. A cuDNN heuristic that picks a non-tensor-core kernel because the input shape isn’t recognised. A framework allocator that fragments VRAM and forces a smaller batch size. A driver version old enough that the runtime falls back to a generic implementation path instead of the architecture-specific one.

This is why treating the software stack as a first-class performance component matters when reading any GPU benchmark. The same A100 or H100, on the same workload, will produce materially different numbers across driver/CUDA/framework triples. The hardware didn’t change. The kernel selection did.

LynxBench AI records the host driver version, the container’s CUDA build, and the framework wheel as a single executor-version triple alongside every benchmark run, because reproducible AI performance measurement on Linux depends on those three lines being pinned and inspectable. The question to put to any Linux GPU performance result is whether that triple is recorded next to the number — or whether the result is anchored only to the GPU model and the rest of the stack is left implicit. Does the Linux GPU performance figure you are about to act on travel with its driver/CUDA/framework triple — the production software stack that decides which kernels actually load — or only with the GPU model, leaving the kernel coverage your deployment will see unaudited?

Frequently Asked Questions

Should I use the distribution’s NVIDIA packages or the official NVIDIA repository for an AI training node?

Use the official NVIDIA package repository or runfile. Ubuntu’s nvidia-driver-xxx packages typically lag upstream by one or two minor versions and often omit the CUDA toolkit, NCCL, and a matching libcudnn that AI workloads need. The distribution packages are fine for desktop graphics, but not for a training node where the full toolkit chain must be present and version-aligned.

Why does `torch.cuda.is_available()` return False even though `nvidia-smi` works?

A clean nvidia-smi only confirms the kernel module loaded and the driver is talking to the GPU; it says nothing about the CUDA runtime PyTorch will dlopen at import time. When nvidia-smi succeeds but torch.cuda.is_available() returns False, the runtime is the issue rather than the driver — usually a missing libcudart.so on the loader path or a CUDA toolkit version the wheel was not built against.

How does the NVIDIA Container Toolkit simplify driver management for Docker and Kubernetes nodes?

It collapses the compatibility surface to a single host-side variable: the driver version. The host needs only the NVIDIA kernel driver, while the CUDA toolkit, cuDNN, and framework all live inside the container image. Each image declares its minimum required driver, and the toolkit validates at launch — refusing to start with an explicit message rather than silently mis-computing on a host-installed mismatch.

What is the safest way to roll out GPU driver updates across a production cluster?

Pin the driver to a known package version and roll updates one node at a time, gated by a smoke test. We launch a known PyTorch container, run a short inference job, and confirm throughput is within an expected band before advancing to the next node. This rolling strategy keeps most GPU capacity available during maintenance windows and catches a regression on one node before it reaches the rest.