Doubleword | Behind the Stack Ep. 12 - Understanding Model Parallelism

When you deploy large language models, one of the first constraints you’ll hit isn’t algorithmic - it’s hardware. Models today can easily exceed the memory of a single GPU.

That’s where model parallelism comes in.

In this episode of Behind the Stack, we’ll unpack what model parallelism means, the two main types used in inference (tensor and pipeline parallelism), and when you should use each.

What Is Model Parallelism?

Model parallelism is what happens when a model is too large to fit on a single GPU - or when you want to make better use of multiple GPUs without duplicating all the weights.

Instead of replicating the same model on every GPU (as in data parallelism), model parallelism splits the model itself across GPUs. Each GPU holds part of the model and runs part of the forward pass.

You’ll encounter it in two main situations:

When models are too large to fit in one GPU’s VRAM (for example, mixture-of-experts or 200B+ parameter models).
When you have several smaller GPUs (like A10s or L4s) and want to maximize utilization by avoiding weight duplication.

In both cases, model parallelism lets you treat multiple GPUs as a single, larger compute unit.

The Two Main Types: Tensor and Pipeline Parallelism

There are many ways to split a model across devices, but in modern inference engines, two dominate: tensor parallelism and pipeline parallelism.

They solve different problems - one focuses on performance within a single model invocation, the other on overall system throughput.

Tensor Parallelism

At its core, tensor parallelism divides the tensors inside each layer (typically weight matrices) across GPUs.

Language models spend most of their time performing massive matrix multiplications - multiplying input activations by billions of stored weights. When those weights are too large for one GPU, we can split them across two or more devices.

Each GPU computes a partial result on its slice of the matrix, and the results are summed to form the final output. Libraries like NVIDIA’s NCCL handle this cross-GPU communication efficiently.

This approach has two big benefits:

Larger model capacity - you can store and run models that wouldn’t otherwise fit on a single GPU.
Higher throughput - because each GPU loads only a portion of the weights, you can often double effective memory bandwidth and compute utilization.

In theory, if you’re strongly bandwidth-bound (as in decoding), two GPUs can achieve up to 2× faster inference. Likewise, during compute-bound phases (like prefill), more GPUs mean more floating-point operations available in parallel.

In practice, you won’t see perfect linear scaling - communication overhead adds latency. But tensor parallelism can provide substantial gains, especially within a single tightly connected node.

Communication Matters

Each step of a tensor-parallel layer involves cross-GPU communication to aggregate results. That means interconnect speed is critical.

Within a single node (GPUs connected via NVLink or PCIe), this is manageable. Across nodes, where communication is slower, performance drops sharply.

That’s why most setups use tensor parallelism within a node, and other forms of parallelism (like pipeline) between nodes.

Pipeline Parallelism

Pipeline parallelism takes a different approach: instead of splitting layers across GPUs, it splits stacks of layers.

Imagine a 60-layer transformer:

GPU 0 handles layers 0–20
GPU 1 handles layers 20–40
GPU 2 handles layers 40–60

A single request flows through each stage in sequence, from one GPU to the next. This allows you to fit larger models, but it’s initially inefficient - at any given moment, only one GPU is busy while the others sit idle.

To fix that, we pipeline requests. While GPU 1 is processing the first batch, GPU 0 can start the next batch. Over time, this fills the pipeline so all GPUs work concurrently.

The result:

No latency improvement for individual requests (each still passes through all stages).
Higher overall throughput, since GPUs stay busy serving multiple micro-batches in flight.

This “assembly line” approach is great for high-throughput serving, especially when you’re not bound by memory bandwidth but by sheer compute volume.

Tensor vs. Pipeline Parallelism — When to Use Each

While both approaches let you split a model across multiple GPUs, they optimise for different goals. Here’s how to think about when each one makes sense:

Tensor Parallelism

Use tensor parallelism when:

You care about per-request latency or performance.
Your workload is bandwidth-bound, such as long decoding sequences.
You’re running within a single node where GPUs have fast interconnects (NVLink, high-bandwidth PCIe).
You want to maximize the effective memory bandwidth and FLOPs for a single forward pass.

Characteristics:

Splits tensors within layers (e.g., weight matrices).
Requires frequent cross-GPU communication to combine partial results.
Can deliver significantly higher throughput per request in bandwidth-bound scenarios.
Communication cost increases with the number of GPUs, so it works best inside one node.

Pipeline Parallelism

Use pipeline parallelism when:

You want to increase overall system throughput rather than single-request latency.
You’re running across multiple nodes with slower interconnects.
Your workload has many concurrent requests that can “fill the pipeline.”
The model is too large to fit into a single node even with tensor parallelism.

Characteristics:

Splits the model by layers, assigning consecutive blocks of layers to different GPUs.
Individual requests still incur the same latency as running on a single GPU — pipeline parallelism does not speed up a single forward pass.
Great at keeping many GPUs busy simultaneously once the pipeline is full.
More robust to communication overhead, because cross-node transfers of activations are infrequent and easier to hide behind compute.

How large-scale deployments combine both

For extremely large models:

Use tensor parallelism within each node (fast intra-node links).
Use pipeline parallelism between nodes (slower inter-node links).

This hybrid approach reduces communication overhead while still enabling models far larger than a single node’s memory capacity.

Wrapping Up

Model parallelism is what allows today’s frontier models to exist at all.

Tensor parallelism helps you fit and accelerate models within a node.
Pipeline parallelism helps you scale and utilize hardware across nodes.

Together, they’re the foundation of distributed inference - the invisible architecture that keeps large models running smoothly.

‍

Behind the Stack Ep. 12 - Understanding Model Parallelism

What Is Model Parallelism?

The Two Main Types: Tensor and Pipeline Parallelism

Tensor Parallelism

Communication Matters

Pipeline Parallelism

Tensor vs. Pipeline Parallelism — When to Use Each

Tensor Parallelism

Pipeline Parallelism

How large-scale deployments combine both

Wrapping Up

Footnotes

Table of contents:

Want to learn more?