A team running molecular dynamics on a tuned CUDA kernel spends three weeks shaving 12% off kernel runtime, then discovers that swapping the neighbour-list data layout from array-of-structs to struct-of-arrays cuts total simulation time by another 40%. That second win was not waiting in the kernel — it was waiting in the algorithm. Drug discovery workloads are full of this asymmetry: the biggest GPU speedups come from algorithmic restructuring (data layout, batching strategy, compute decomposition), not from squeezing the last few percent out of a single kernel. This spoke explores where that boundary sits across molecular dynamics, virtual screening, and deep-learning property prediction — three workloads that dominate pharmaceutical R&D compute budgets. The structural reasoning behind the choice is covered in the parent CCU, when algorithmic restructuring gives bigger GPU speedups than kernel tuning. Why algorithmic structure dominates kernel tuning in drug discovery The architecture argument is well known: a modern NVIDIA H100 or A100 GPU exposes tens of thousands of CUDA cores designed for SIMT execution, while a CPU socket exposes dozens of general-purpose cores. The practical consequence is less obvious. A kernel that achieves 60% of peak FLOPs but feeds the SMs the wrong data layout will lose more performance to memory-bandwidth stalls than any further register-allocation tuning can recover. We see this pattern regularly in drug-discovery code: the kernel is not the bottleneck; the surrounding data movement is. Friedrichs et al. (2009) reported one of the early concrete cases — molecular dynamics on early CUDA hardware ran roughly 100× faster than equivalent CPU code, but most of that speedup came from restructuring the integration loop and neighbour-list representation, not from kernel-level tricks. Stone et al. (2010) made the same observation about NAMD and VMD: porting an algorithm to GPU without reshaping its memory access patterns leaves most of the available speedup on the table. When you have hit the kernel-tuning ceiling A few signals usually appear together: Nsight Compute or ncu reports memory throughput near peak HBM bandwidth, but compute utilisation stays modest. Profiler traces show long stalls on LDG / STG instructions or on __shfl_sync waits, not on arithmetic. Doubling the SM clock changes wall-clock by less than 5%. The kernel already uses shared memory, coalesced loads, and reasonable occupancy. When those four conditions hold simultaneously, further kernel tuning is an observed pattern of diminishing returns. The next-largest speedup lives at the algorithm level. Three algorithmic levers that move the needle Data layout The single highest-leverage change in molecular dynamics and docking codes is the representation of atomic coordinates and neighbour lists. Array-of-structs layouts ({x,y,z,charge,type} per atom) feel natural in C++ but defeat memory coalescing — adjacent threads read non-adjacent bytes. Struct-of-arrays ({all x}, {all y}, {all z}) lets a warp of 32 threads issue one coalesced load. The kernel may look almost identical; the throughput is not. For deep-learning compound property prediction, the equivalent lever is tensor layout — NHWC versus NCHW, or the choice of mixed-precision storage. Goh et al. (2017) noted that practitioners porting cheminformatics models to GPU often gained more from precision and layout decisions than from optimising the convolution kernels themselves. Batching strategy Virtual screening — docking millions of compounds against a target protein — is embarrassingly parallel at the compound level but not at the kernel level. A naive implementation launches one kernel per compound and starves the GPU on launch overhead. Batched docking, where hundreds of compound poses share a single kernel launch, often produces order-of-magnitude wall-clock improvements without touching the inner loop. Brown et al. (2020) describe this as a structural property of high-throughput screening pipelines: the right batch size for a given GPU dominates per-kernel optimisation. For training deep-learning property models, batch size interacts with three things at once: SM occupancy (too small a batch leaves SMs idle), HBM bandwidth (too large a batch spills activation memory), and gradient noise (batch size affects optimiser dynamics). On an A100 80GB training a graph neural network for ADMET prediction, the right batch size is often the difference between a 12-hour and a 3-hour epoch. Compute decomposition Some calculations are structurally hostile to the GPU until decomposed differently. Long-range electrostatics in molecular dynamics is the classic case — the naive O(N²) all-pairs formulation maps onto the GPU but scales badly. Particle-Mesh Ewald or Fast Multipole decompositions are algorithmically more complex but match the hardware’s strengths. Compiling MD code that already uses PME and then tuning its kernels yields a different — and much larger — final throughput than tuning kernels on the O(N²) version. Where this leaves drug discovery teams Workload Kernel-tuning ceiling Algorithmic lever with bigger payoff Molecular dynamics Coalesced loads, shared-memory tiling done Neighbour-list layout, PME vs all-pairs Virtual screening / docking Per-pose kernel optimised Batched launches across compounds and poses DL property / ADMET models Mixed-precision GEMM tuned Batch size, tensor layout, graph compilation (TorchScript, ONNX, TensorRT) Genome alignment Banded alignment vectorised Read batching, suffix-array vs hash-index choice The pattern across these rows is consistent: kernel tuning is real engineering work, but its return is bounded. Algorithmic restructuring is where the next 10× tends to live. What a structured GPU performance analysis looks like A useful audit in this domain is not a list of kernel timings. It is a classification: each intervention labelled as algorithmic or micro-level, with an estimated impact and an effort cost. The methodology side of this question is treated in profiling GPU kernels to find the real bottleneck, and the cross-platform implications of these algorithmic choices are covered in what cross-platform GPU performance portability actually requires. Across our GPU engagements in life sciences and biotech, the audits that produce the largest follow-on speedups are the ones that explicitly separate “the kernel is slow” from “the algorithm is wrong for this hardware”. The two findings have very different remediation costs and very different ceilings. Conflating them is how teams end up spending engineering quarters on the wrong layer. FAQ When does algorithmic restructuring give a bigger GPU speedup than kernel-level tuning? When the kernel already shows good occupancy and coalesced memory access but the workload is bottlenecked on memory bandwidth, launch overhead, or unfavourable data layout. In drug discovery, this is the common case — most large wins come from changing neighbour-list representation, batching strategy, or compute decomposition (PME, batched docking), not from further kernel optimisation. How do I tell that my GPU code has hit its kernel-tuning ceiling? Four signals together: profiler reports memory throughput near peak HBM bandwidth but moderate compute utilisation; long stalls on global memory load/store instructions; raising SM clock changes wall-clock by less than 5%; the kernel already uses shared memory, coalesced loads, and reasonable occupancy. When those co-occur, the next gain lives at the algorithm level. Which algorithmic changes typically unlock the biggest speedups? In molecular dynamics: data layout (struct-of-arrays vs array-of-structs) and switching long-range electrostatics from O(N²) to PME or FMM. In virtual screening: batched docking across compounds and poses to amortise kernel-launch overhead. In deep-learning property prediction: batch size tuning, tensor layout (NHWC vs NCHW), and mixed-precision storage decisions. How does batch size interact with GPU occupancy and memory bandwidth in deep-learning workloads? Batch size sets three things at once. Too small and SMs sit idle because there is not enough work to fill warps — occupancy collapses. Too large and activation memory spills, forcing HBM traffic that the compute units cannot hide. The right batch size is the one that fills SMs while keeping activations in cache hierarchies as long as possible — typically found by sweeping, not by formula. What does a structured GPU performance analysis look like beyond “make the kernel faster”? It classifies each candidate intervention as algorithmic (data layout, batching, decomposition) or micro-level (occupancy, register pressure, instruction selection), estimates the impact and effort of each, and sequences them so that algorithmic changes happen before micro-level tuning. The micro-level work is then done on the right algorithm, not the wrong one. References Brown, N., Ertl, P. and Lewis, R. (2020) Artificial Intelligence in Drug Discovery. Journal of Medicinal Chemistry, 63(16), pp. 8657–8666. Friedrichs, M.S., Eastman, P. and Vaidyanathan, V. (2009) Accelerating Molecular Dynamics Simulations on GPUs. Journal of Computational Chemistry, 30(6), pp. 864–872. Goh, G.B., Hodas, N.O. and Vishnu, A. (2017) Deep Learning for Computational Chemistry. Journal of Chemical Information and Modeling, 57(8), pp. 1757–1772. Stone, J.E., Hardy, D.J. and Phillips, J.C. (2010) GPU Computing in Molecular Modelling. Journal of Molecular Graphics and Modelling, 29(2), pp. 116–125. Zou, J., Huss, M. and Abid, A. (2019) A Primer on Deep Learning in Genomics. Nature Genetics, 51(1), pp. 12–18. Image credits: Freepik.