Inference
Quantization
Quick Answer
Reducing model precision (e.g., from float32 to int8) to reduce memory and computation.
Quantization represents model weights in lower precision (int8, int4, fp8) instead of float32. Quantized models use less memory, run faster, and use less energy. Quality degradation is often minimal. Quantization-aware training learns robust quantization. Post-training quantization quantizes pretrained models. Inference-only quantization doesn't require retraining. Quantization is practical for deployment. Different quantization schemes (symmetric, asymmetric) have trade-offs.
Last verified: 2026-04-08