Inference

Tensor Parallelism

Quick Answer

Distributing model computations across multiple GPUs by splitting tensors.

Tensor parallelism splits computation across GPUs by distributing tensors. Different layers are split differently. Each GPU processes a slice of data. Communication between GPUs is required. Tensor parallelism enables running models larger than single GPU memory. It's practical for 70B+ models. Communication overhead limits scaling—16 GPUs is typical. Tensor parallelism is complementary to pipeline parallelism.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →