Planning GPU Memory for Deep Learning Training

Plan GPU memory before a training run: estimate weights, activations, optimiser state, and workspace so jobs do not crash on OOM.

Planning GPU Memory for Deep Learning Training
Written by TechnoLynx Published on 16 Feb 2026

Introduction

Training a deep neural network fails for one plain reason more often than any other: it runs out of GPU memory. The job stops, you lose wall-clock time, and you waste paid compute. Memory planning is the cheapest piece of engineering discipline that survives contact with real workloads. It lets you pick hardware, set a safe batch size, and choose settings that fit the constraints before the run starts — instead of after the first OOM crash. This article is the methodology piece for the broader GPU engineering practice: the steps, the rationale, and the boundary conditions where the estimates stop working.

The naive assumption is that only large models hit memory ceilings. In practice, smaller models fail just as often when the input data is large, the batch is high, or the network keeps many intermediate tensors for the backward pass. Memory estimation work from the last few years shows that developers consistently mis-predict usage before a run, and the mismatch causes a large share of job failures.

What this means in practice

  • Estimate five memory components separately: parameters, gradients, optimiser state, activations, and workspace.
  • Treat the optimiser state as the silent multiplier — Adam adds two extra tensors per weight, doubling or tripling parameter-class memory.
  • Use gradient checkpointing and mixed precision as the first two levers before reaching for larger GPUs.
  • Reserve 10–15% headroom for workspace and fragmentation; “fits in theory” is not the same as “fits in practice.”

How do I estimate GPU memory usage before launching a training job?

Decompose the model’s memory footprint into the five components and sum them. Parameters give you the baseline: count the weights, multiply by the dtype size (4 bytes for FP32, 2 for FP16/BF16). Gradients match the parameter shape, so add the same amount again. Optimiser state depends on the optimiser: SGD adds nothing extra, momentum adds 1× the parameter size, Adam-class methods add 2× (first and second moments).

Activations are the hardest component to predict from a spec sheet because they depend on batch size, sequence length, and architecture. A useful lower bound is to forward-pass the model with a representative batch and read the live memory; multiply by the depth of the gradient graph for the worst case. Workspace covers cuDNN scratch buffers, allocator fragmentation, and intermediate tensors that the framework keeps for performance. Add 10–15% to the total to cover it.

What does the model size, batch size, and sequence length contribute to GPU memory?

The model contributes a fixed cost: parameters plus gradients plus optimiser state, all scaled by dtype. This dominates for large language models and small for older convnets. Batch size scales activation memory linearly — doubling the batch doubles the activations. Sequence length scales activations linearly for most architectures and quadratically for attention without efficient kernels (FlashAttention-class implementations bring this back to linear).

The interaction matters: a small model with a long sequence and a high batch can use more memory than a large model with a short sequence and a small batch. Plan against the dominant term for your workload class, not against the parameter count alone.

When should I use gradient checkpointing, mixed precision, or model sharding to fit memory?

These three techniques cover the most common scaling paths. Mixed precision (FP16 or BF16 for activations, FP32 for the master weights) halves activation and parameter memory with minimal accuracy impact for most architectures — it is the first lever and the cheapest. Gradient checkpointing trades compute for memory by re-computing activations on the backward pass instead of storing them — useful when activations dominate, which they do for deep networks with long sequences.

Model sharding (ZeRO-style or tensor parallelism) splits the parameters, gradients, and optimiser state across GPUs. Reach for it when the model itself does not fit on a single device, even with the other two techniques applied. Sharding adds communication overhead, so the practical order is: mixed precision first, gradient checkpointing second, sharding when the model state exceeds device capacity.

How does the optimiser choice affect GPU memory footprint?

The optimiser is the silent memory multiplier. SGD without momentum adds no per-parameter state. SGD with momentum adds one tensor per parameter, doubling parameter-class memory. Adam and its variants (AdamW, Adafactor, Lion) add two tensors per parameter, tripling it. For a model with 7B parameters in FP32, that is 28 GB for the optimiser state alone before any activations are allocated.

Newer optimisers trade memory for slightly worse convergence in specific regimes: Adafactor uses factored second moments to cut state per parameter, and 8-bit optimisers quantise the state to drop it by 4×. For training runs where memory is the binding constraint, switching the optimiser is often a faster win than reducing the batch size.

What is a safe headroom margin to avoid out-of-memory crashes mid-run?

Allocator fragmentation, cuDNN workspace selection, and dynamic activation shapes all consume memory that is hard to predict statically. A safe headroom margin is 10–15% of total device memory left unallocated by the planned components. Production training pipelines often add another 5% for safety against memory spikes during gradient accumulation or evaluation steps interleaved with training.

The signal that headroom is too tight: occasional OOM crashes that disappear when you reduce batch size by 1, or when allocator settings are changed. The signal that headroom is too loose: idle device memory that could host a larger batch and improve throughput. Both are tunable once the run is stable.

How do framework-level settings (PyTorch, TensorFlow) change effective memory usage?

Allocator behaviour matters more than most developers realise. PyTorch’s caching allocator holds freed blocks for reuse, which improves performance but increases reported peak memory. The PYTORCH_CUDA_ALLOC_CONF environment variable exposes knobs to cap segment size and reduce fragmentation. TensorFlow’s memory growth setting (tf.config.experimental.set_memory_growth) prevents preallocation of the entire device, which is essential when sharing a GPU across jobs.

cuDNN’s benchmark flag selects faster algorithms at the cost of extra workspace; turn it off when memory is tighter than latency. Mixed-precision settings (torch.autocast, tf.keras.mixed_precision) need their loss scaler tuned for FP16 to converge — BF16 sidesteps this on newer hardware. Each of these settings shifts the effective memory ceiling by 5–20% without changing the model.

How TechnoLynx Can Help

TechnoLynx delivers GPU Performance Audits that profile your training workload, map the five memory components against the actual device, and produce a ranked optimisation roadmap — mixed precision, checkpointing, optimiser swaps, sharding — sized to the constraint that matters. If your training jobs are crashing on OOM or your batch sizes feel arbitrary, contact us to get a memory plan grounded in measurement rather than guesswork.

Image credits: Freepik

Back See Blogs
arrow icon