Tensor Parallelism vs Pipeline Parallelism: Choosing the Right Strategy for Your GPU Cluster

Two ways to distribute a model across GPUs

When a model is too large to fit on a single GPU — or when you need more throughput than one GPU provides — you must distribute computation across multiple GPUs. The two primary strategies split the model differently, and choosing between them is rarely about preference. It is about which interconnect tier you actually have, and whether your workload is latency-sensitive or throughput-sensitive.

Tensor parallelism (TP) splits individual operations across GPUs. A single matrix multiplication is divided so each GPU computes a portion, then results are combined via all-reduce communication. Every GPU participates in every layer’s computation.

Pipeline parallelism (PP) splits model layers across GPUs. GPU 0 runs layers 1–10, GPU 1 runs layers 11–20, and so on. Each GPU runs complete operations but only for its assigned layers. Data flows through the pipeline sequentially.

In short: tensor parallelism splits individual operations across GPUs (low latency, high bandwidth requirement); pipeline parallelism splits model layers across GPUs (tolerates lower bandwidth, adds pipeline bubble overhead). The two are not interchangeable knobs — they sit on different points of the bandwidth-versus-latency frontier, and the cluster’s interconnect topology decides which one is even feasible.

How tensor and pipeline parallelism differ in practice

Dimension	Tensor parallelism	Pipeline parallelism
Communication pattern	All-reduce after every operation	Point-to-point between adjacent stages
Bandwidth requirement	Very high (NVLink-class, on the order of 600+ GB/s)	Moderate (PCIe or InfiniBand sufficient)
Latency per token	Low (all GPUs compute simultaneously)	Higher (sequential stage execution)
GPU utilisation	High (all GPUs always active)	Reduced by pipeline bubble (idle time between micro-batches)
Scaling limit	Typically 4–8 GPUs per TP group; communication overhead grows beyond	Limited by bubble fraction and memory per stage
Memory efficiency	Each GPU holds partitioned layer weights	Each GPU holds only its assigned layers’ full weights
Typical framework support	Megatron-LM, DeepSpeed, vLLM, TensorRT-LLM	DeepSpeed, PipeDream-style schedulers, Megatron-LM

The comparison is observed-pattern-class: these are the structural tradeoffs we see across deployments, not figures from a single benchmark run. The bandwidth numbers are framing references, not promises — the actual achievable communication rate on any specific cluster depends on the NIC, switch fabric, and topology.

When does tensor parallelism win?

TP is optimal when:

GPUs are connected via high-bandwidth interconnect (NVLink within a node, typically around 900 GB/s on H100 per NVIDIA’s published specifications)
Latency matters more than throughput (real-time inference, interactive applications, chat-style serving)
The model fits across a small number of GPUs (2–8) with TP alone
Every GPU should contribute to every token’s computation

The constraint is structural. TP requires all-reduce communication after each tensor operation. On NVLink, this adds microseconds. On PCIe (roughly 64 GB/s for Gen4 x16) or on cross-node InfiniBand (commonly 200–400 Gb/s per port), the communication time dominates computation time, and TP collapses into something that runs but no longer accelerates. In our experience, teams who try to extend TP across node boundaries usually discover this the hard way — by watching utilisation drop without throughput rising.

When does pipeline parallelism win?

PP is optimal when:

GPUs are connected via lower-bandwidth links (cross-node InfiniBand, or PCIe within a node without NVSwitch)
The model is large enough to require many GPUs (16+)
Throughput matters more than per-request latency
The pipeline bubble can be amortised by filling it with micro-batches

The pipeline bubble is the unavoidable cost. When a pipeline starts, only GPU 0 is active; the others wait their turn. At the end of a batch, GPUs drain sequentially. The idle fraction is approximately (p − 1) / (p − 1 + m), where p is pipeline stages and m is micro-batches. With 8 stages and 32 micro-batches, bubble overhead is on the order of 18%. With fewer micro-batches — which is exactly what inference traffic often produces — the bubble grows quickly. This is one of several reasons inference deployments cannot simply inherit a training cluster’s parallelism plan; the broader pattern is covered in our analysis of why training and inference are fundamentally different workloads.

Is pipeline parallelism worth the bubble cost?

That depends on whether the alternative is feasible at all. If the model does not fit in a single TP group’s memory budget, pipeline parallelism is not a choice — it is the only path to running the model. The right question is not “TP or PP” but “what is the minimum bubble fraction I can tolerate given my latency target and my interconnect tier?”

The hybrid reality: TP within nodes, PP across nodes

Production deployments of large models almost always use both strategies simultaneously:

TP within a node, leveraging NVLink’s high bandwidth for low-latency intra-operation communication
PP across nodes, tolerating InfiniBand’s lower bandwidth for inter-stage communication

A 32-GPU deployment across 4 nodes commonly runs TP=8 (within each 8-GPU node) and PP=4 (across the 4 nodes). This maps the parallelism dimensions onto the interconnect tiers that can actually sustain them. Megatron-LM and DeepSpeed both ship with this kind of 2D/3D parallelism as the default for very large models, and frameworks like vLLM and TensorRT-LLM expose TP-degree as a first-class deployment parameter for inference.

The optimal strategy depends on interconnect bandwidth and model architecture — not just GPU count. A model that achieves X tokens per second with TP=4 on NVLink-connected A100s will reach a very different number with TP=4 on PCIe-connected A100s, because the all-reduce that was free on NVLink now sits on the critical path. Performance numbers do not transfer across topologies. This is one of the reasons benchmark consumers should treat “tokens/sec on N GPUs” as incomplete information until the parallelism plan and interconnect tier are named.

Data parallelism: the third dimension

Data parallelism (DP) runs identical model replicas on different GPUs, each processing different input batches. DP is the simplest form: each GPU holds a complete model copy, processes a different batch, and synchronises gradients via NCCL all-reduce at the end of the step. It scales throughput close to linearly with GPU count when communication is well-overlapped with computation, but it requires each GPU to hold the full model in memory — which is exactly the constraint TP and PP exist to break.

The full 3D combination is:

TP within nodes for latency-bound intra-operation work
PP across node groups for fitting the model
DP across replica groups for throughput

The specific combination for your deployment depends on model size, available GPUs, interconnect topology, and whether you optimise for latency or throughput. Inference-heavy deployments usually push TP and DP and minimise PP (because bubbles hurt tail latency); training-heavy deployments often accept a larger PP degree because they can amortise the bubble with hundreds of micro-batches and they care about steady-state throughput.

How do you actually pick a strategy?

A practical decision rubric, in order:

Does the model fit on one GPU? If yes, start with DP and stop. Parallelism within the model is overhead you do not need.
Does it fit in a single node’s aggregate memory? If yes, TP within the node is almost always the right starting point.
Does it require multiple nodes? Then TP within each node, PP across nodes. Choose the TP degree to match the NVLink/NVSwitch domain.
Is throughput the goal? Add DP across replica groups once a single TP×PP group is saturated.
Is latency the goal? Maximise TP degree within the bandwidth domain, minimise PP stages, and accept a smaller deployment footprint per replica.

Step 3 is where most procurement-driven mistakes happen. Teams buy a cluster optimised for one parallelism plan and discover their workload needs another — most often, a training cluster (PP-heavy, throughput-tuned) being asked to serve interactive inference (TP-heavy, latency-tuned). LynxBench AI treats the parallelism strategy — tensor, pipeline, hybrid — together with the interconnect topology as part of the AI Executor specification, because the same model on the same GPUs reaches different throughput under different partitionings. Before accepting any large-model performance claim as evidence, ask: do the parallelism strategy, interconnect bandwidth, and runtime scheduler match the deployment’s hardware-software stack, or was the published number achieved on a topology the procurement cannot reproduce?

Frequently Asked Questions

Why is tensor parallelism so sensitive to interconnect bandwidth?

Tensor parallelism requires an all-reduce communication after every tensor operation, so the GPUs are constantly exchanging partial results on the critical path. On NVLink-class links (around 900 GB/s on H100 per NVIDIA’s specs) that exchange adds microseconds; on PCIe Gen4 (~64 GB/s) or cross-node InfiniBand (200–400 Gb/s per port) it dominates the computation it was meant to accelerate. That is why TP usually stays inside a 4–8 GPU NVLink/NVSwitch domain.

What is the pipeline bubble and how do I estimate it?

The pipeline bubble is the idle time when a pipeline fills up and drains — at the start only GPU 0 is active, at the end GPUs finish sequentially. The idle fraction is approximately (p − 1) / (p − 1 + m), where p is pipeline stages and m is micro-batches; 8 stages with 32 micro-batches lands around 18%. Because inference traffic produces few micro-batches, the bubble grows quickly, which is why latency-bound deployments minimise pipeline depth.

How do you choose the TP degree for a multi-node deployment?

Match the TP degree to the NVLink or NVSwitch domain: keep all-reduce traffic inside the high-bandwidth tier and let pipeline parallelism span node boundaries. A 32-GPU deployment over 4 nodes commonly runs TP=8 within each node and PP=4 across them. Pushing TP past the NVLink domain onto PCIe or InfiniBand drops utilisation without raising throughput.

How do hybrid TP, PP, and DP fit together in production?

Production deployments of large models combine all three: TP within a node to exploit NVLink for low-latency intra-operation work, PP across nodes to fit the model under InfiniBand’s lower bandwidth, and DP across replica groups for throughput. The exact mix depends on model size, GPU count, topology, and whether you optimise for latency or throughput.