Inference
Tensor Parallelism
Quick Answer
Distributing model computations across multiple GPUs by splitting tensors.
Tensor parallelism splits computation across GPUs by distributing tensors. Different layers are split differently. Each GPU processes a slice of data. Communication between GPUs is required. Tensor parallelism enables running models larger than single GPU memory. It's practical for 70B+ models. Communication overhead limits scaling—16 GPUs is typical. Tensor parallelism is complementary to pipeline parallelism.
Last verified: 2026-04-08