MFU is a useful efficiency metric, not a target to maximise Model FLOPS Utilization (MFU) measures the fraction of a GPU’s theoretical peak FLOPS that a training run actually achieves. An MFU of 0.5 means the training is using 50% of the GPU’s theoretical compute capacity. The metric was popularised by the PaLM paper and has since become the default way teams compare training efficiency across model sizes, parallelism strategies, and hardware generations. MFU is genuinely useful as a diagnostic. It tells you, in one number, how close your training code is getting to the silicon’s published ceiling. It is not a target to maximise in isolation, and treating it as one leads to bad engineering decisions. Several operations that are correct and necessary — data loading, gradient communication, optimizer steps, activation checkpointing — reduce MFU without representing wasted compute. A run with 70% MFU and broken convergence is not preferable to a run with 35% MFU and stable training. The most common misuse we see is treating MFU as a property of the GPU rather than a property of the joint configuration. The same H100 will show 60% MFU on a well-tuned dense transformer with FlashAttention and 15% MFU on the same model trained naively in eager PyTorch. The hardware did not change; the stack on top of it did. How MFU is calculated The arithmetic is straightforward. You divide the FLOPs your training step actually performs by what the GPU could theoretically perform in the same wall-clock time at the relevant precision. MFU = achieved_FLOPS / theoretical_peak_FLOPS For transformer training, the standard approximation for FLOPs per training step is 6 × P × S × B, where P is the parameter count, S is the sequence length, and B is the batch size per GPU. The 6 comes from 2 FLOPs per parameter for the forward pass (one multiply, one add) plus roughly 4 for the backward pass (gradient of activations and gradient of weights). This approximation ignores attention FLOPs and embedding FLOPs, which matter at long sequence lengths. A worked example, on the order of what we see in practice. A 7B parameter model on a single H100 SXM (989 TFLOPS theoretical BF16) trains at sequence length 2048 with batch size 4 per GPU. If a training step takes 1.2 seconds: step_FLOPs = 6 × 7e9 × 2048 × 4 ≈ 3.4e14 achieved_FLOPS = 3.4e14 / 1.2 ≈ 2.85e14 = 285 TFLOPS MFU = 285 / 989 ≈ 0.29 Twenty-nine percent looks low, and a naive reading would call this a problem. It is not. Single-GPU training of a 7B model at this batch size is partially memory-bandwidth-bound in the attention kernels, and the 6N approximation slightly overestimates the true FLOP count. The number is informative once you know what it is measuring. Typical MFU values The right benchmark for an MFU number is not “100%” — it is the typical range for the configuration. The table below is an observed pattern across training runs we have profiled; treat the ranges as planning heuristics, not guarantees. Configuration Typical MFU Dominant overhead Well-optimised dense transformer, single GPU 0.50–0.70 None (compute-bound) Multi-GPU DDP, small model 0.40–0.60 Gradient all-reduce Pipeline parallel 0.30–0.50 Pipeline bubble Tensor parallel (Megatron-style) 0.45–0.65 Activation all-gather / reduce-scatter MoE training 0.20–0.40 All-to-all routing, expert imbalance Values above 0.70 are excellent and usually require custom kernels, very large batch sizes, or FP8 on H100. Values below 0.20 on a dense transformer almost always indicate a fixable bottleneck — a stalled input pipeline, a wrong precision setting, or a kernel falling back to a slow path. This is an observed-pattern claim, not a published benchmark. The exact range depends on framework version, FlashAttention version, batch size, and the specific model architecture. MFU bands as a diagnostic rubric The headline-MFU table above describes typical achievable ranges for a configuration. A second cut, useful when triaging an unknown training job, asks what an observed MFU implies about the run. MFU range Interpretation Most common cause > 60% Excellent Well-optimised large model, tuned kernels, NVLink, large batch 45–60% Good Typical for well-configured single-node or NVLink training 30–45% Acceptable Smaller models, off-the-shelf configuration, no fused kernels 15–30% Investigate Data-loader stalls, suboptimal batch size, communication overhead < 15% Problem Major bottleneck — tiny batch, wrong precision, broken pipeline The band that matters is not the absolute number but the gap between current MFU and the achievable MFU for the workload. A 35% MFU on a Mixture-of-Experts model with heavy expert-routing communication may already be near ceiling. A 35% MFU on a dense 13B model on an 8×H100 NVLink node almost certainly has 20+ points of headroom. Treat the rubric as a diagnostic prior, not a verdict. A diagnose-intervene-remeasure sequence for improving MFU The pattern that has worked for us is diagnose-intervene-remeasure, run three times. The non-obvious part is the order: fixing communication overhead before fixing a data stall is wasted work, because the data stall hides the comm overhead — the GPUs were waiting for input, not waiting on each other. Baseline. Measure current MFU with the existing stack at steady state. A fresh process has cold caches, an un-warmed CUDA context, and untouched thermal headroom — see peak vs steady-state performance in AI for why peak-burst numbers mislead. Identify the primary bottleneck. Profile with PyTorch Profiler, Nsight Systems, or dcgm-exporter + Grafana. Look for cudaStreamSynchronize waits (data stalls), large NCCL bars (comm overhead), low SM occupancy on matmul kernels (memory-bandwidth-bound), or a wall of tiny kernels (launch overhead). Apply the lowest-cost intervention first. Data stall → more DataLoader workers, prefetch_factor, pinned memory, GPU-side augmentation (DALI). Comm overhead → gradient accumulation, torch.compile with overlap, or moving from PCIe to NVLink topology when interconnect is the dominant cost. Memory-bandwidth-bound → larger batch, FlashAttention, fused kernels. Launch overhead → torch.compile(mode='max-autotune') or CUDA Graphs. Remeasure, then repeat. Each round typically recovers 10–20 percentage points of MFU in our engagements — an observed pattern, not a benchmark. After three rounds, most training jobs land within 5–10 points of the achievable ceiling for that model on that hardware. What does low MFU actually indicate? Low MFU is a symptom, not a diagnosis. It tells you the GPU is not fully busy doing matrix multiplications, which is true a large fraction of the time for valid reasons. The useful question is what the GPU is doing instead. Data loading stall — The GPU is idle waiting for the CPU to produce the next batch. Visible in profiler traces as gaps between kernel launches. Often fixed by increasing dataloader workers, persistent workers, or moving augmentation to the GPU. Memory bandwidth bound — The operation is loading and storing more bytes than it does FLOPs. MFU will be low but the memory subsystem utilisation will be high. Attention at long sequence length, layer norms, and elementwise ops all behave this way. Communication overhead — In multi-GPU training, all-reduce or all-gather operations occupy NCCL while compute stalls. The fix is usually overlapping communication with compute, not pushing MFU higher. Small effective batch size — Too few elements per step to saturate GPU compute. Tensor cores need certain shape alignments to hit peak; small or oddly-shaped matrices fall to slower paths. Framework overhead — Python overhead, excessive kernel launches, eager-mode dispatch, or autograd graph construction. torch.compile and CUDA graphs address most of these. Utilization bottlenecks and the illusion of idle GPUs covers the diagnostic process in more depth. The short version: MFU tells you that compute is not maxed out; a Nsight Systems trace or a PyTorch profiler timeline tells you why. When is pushing for higher MFU not the right goal? There are several configurations where chasing MFU actively damages the run. The first is convergence. Larger batch sizes raise MFU because tensor cores are happier with bigger matrices. They also change the optimisation dynamics. Past a critical batch size, you stop getting more learning per token; you just compute more efficiently per step. If your goal is the best model at a fixed token budget, the MFU-maximising batch size is often too large. The second is memory. Activation checkpointing trades compute for memory by recomputing activations during the backward pass. It reduces MFU (more FLOPs spent on the same gradient) but allows larger models or longer sequences to fit. Removing checkpointing to raise MFU and then running out of memory is not progress. The third is precision. FP8 on H100 can push MFU dramatically higher than BF16 because the theoretical peak is roughly 2× higher and the actual achieved FLOPS often more than doubles. But FP8 requires per-tensor scaling, careful loss-scale management, and occasional fallbacks to higher precision. A team that flips on FP8 to chase MFU without the surrounding infrastructure ends up with NaN losses. The fourth is communication. On multi-node training, overlapping communication with compute can lower MFU as measured per-step (because compute is partly hidden by comm) while raising end-to-end throughput. The metric and the goal point in opposite directions. How do you calculate MFU for your own model? Two numbers, then a ratio. The numerator is your model’s actual achieved FLOPS during training. The denominator is the GPU’s theoretical peak FLOPS at the training precision. For the numerator, instrument your training loop to measure tokens-per-second (or samples-per-second times sequence length), then multiply by the FLOPs-per-token estimate. The 6 × P heuristic is the standard approximation; PaLM-style papers use a more precise version that adds attention FLOPs (12 × layers × hidden × seq² for the attention matmul, which matters at long sequences). PyTorch’s torch.cuda.Event timers around the training step are accurate enough; the profiler is overkill for steady-state measurement. For the denominator, consult the GPU vendor’s spec sheet for the precision you are actually using. An H100 SXM is rated at 989 TFLOPS for BF16 tensor cores, ~1979 TFLOPS for FP8, and only ~67 TFLOPS for FP32 non-tensor. Using the wrong denominator inflates or deflates MFU by 10× or more. This is per NVIDIA’s published specifications; the numbers shift slightly between SXM and PCIe variants and between marketing peak and sustained peak. We track MFU continuously during training runs in our engagements. A sudden MFU drop of more than 5% from baseline is treated as a system event — thermal throttling on a node, a slow disk on the data path, a stuck NCCL collective. MFU trending downward over hours often indicates memory fragmentation requiring a process restart, particularly with dynamic shapes. MFU across hardware generations One genuinely useful property of MFU is hardware-independence. An A100 hitting 45% MFU and an H100 hitting 30% MFU on the same workload are both informative, even though the absolute throughput on the H100 is far higher. The H100’s larger theoretical peak makes the same achieved-FLOPS denominator divide into a smaller number. We have seen this pattern repeatedly when migrating workloads from A100 to H100: the code achieves 42% MFU on A100 and only 28% MFU on H100 out of the box. The gap is almost always architectural feature gap — code that does not use FP8, does not exploit TMA for asynchronous memory copies, or does not take advantage of the larger shared memory per SM. Investigating and resolving these gaps has recovered 30–50% additional throughput across the migrations we have done. This is an observed pattern across our engagements, not a benchmarked rate. Historical MFU data also informs procurement. If workloads consistently achieve 35–40% MFU on NVIDIA hardware but only 20–25% on AMD hardware due to less mature kernels, the effective cost-per-useful-FLOP calculation shifts away from what raw spec sheets suggest. The vendor-published peak FLOPS is necessary but not sufficient input for a real procurement decision. LynxBench AI treats Model FLOPS Utilization as the relationship between achieved arithmetic and peak arithmetic on the AI Executor under the actual training run, not as a hardware property, because MFU is determined by the joint configuration of model, optimizer, parallelism, and software stack on top of the silicon. The methodology check on any quoted MFU number: are the model, parallelism strategy, optimizer, and software-stack version disclosed alongside the percentage as joint inputs to the measurement — or was the figure published as if it were a property of the GPU alone?