Inference
Cold Start
Quick Answer
The latency delay when a model is loaded into memory before it can serve requests.
Cold start is the time to load a model from storage into inference infrastructure before processing requests. For large models (70B+), cold start can be minutes. Cold start is problematic for serverless and dynamic scaling. Warm pools (keeping models ready) avoid cold starts but waste resources. Cold start is a key consideration in deployment design. Some providers optimize cold start. Cold start latency is critical for serverless LLM services.
Last verified: 2026-04-08