Inference
Pipeline Parallelism
Quick Answer
Distributing different layers across multiple GPUs to enable processing multiple batches simultaneously.
Pipeline parallelism assigns different layers to different GPUs. GPUs process different stages of the pipeline simultaneously. Early GPUs work on batch while later GPUs finish previous batch. Pipeline parallelism hides latency. Bubble time (idle GPUs) can be significant. Pipeline parallelism is better for throughput than latency. It's complementary to tensor parallelism. Optimal pipeline configuration depends on hardware.
Last verified: 2026-04-08