LLM API Caching Strategies: Cut Costs Up to 90% in 2026
Quick answer: There are four distinct types of LLM caching, and most teams use only one (if any). Prompt caching alone can cut input costs by 80-90% on workloads with repeated context. Semantic caching can eliminate 20-40% of API calls entirely. Both are available today with minimal engineering effort.
The four types of LLM caching
Before diving into implementation, it helps to understand that "caching" in the LLM context refers to four different things that operate at different layers.
Type 1: Prompt caching (provider-side)
OpenAI and Anthropic cache the KV (key-value) representations of your input tokens on their infrastructure. When you send a request that starts with the same prefix as a previous request, the model skips recomputing those tokens and serves the cached KV state.
Pricing impact:
- OpenAI: cached input tokens cost 50% less than standard
- Anthropic: cached input tokens cost 90% less (cache writes are 25% more, but amortize quickly)
Requirements:
- The cached prefix must be at least 1,024 tokens (OpenAI) or 1,024 tokens (Anthropic)
- Requests must arrive within the cache TTL (5 minutes for Anthropic by default, automatic for OpenAI)
- The prefix must appear at the start of the message, before any variable content
Best use cases: RAG systems with a fixed document corpus, chatbots with a long system prompt, API wrappers that prepend the same instructions to every call.
Type 2: Response caching (application-side)
You cache the LLM's complete response at the application layer — in Redis, Memcached, or a database. When an identical or near-identical request comes in, serve the cached response without hitting the API.
Pricing impact: 100% cost savings for cache hits — you pay nothing for cached responses.
Requirements: Deterministic or near-deterministic request patterns. Works best with temperature=0 or very low temperature settings.
Best use cases: FAQ answering, product description generation from fixed templates, standard contract clause lookup.
Type 3: Semantic caching
Semantic caching extends response caching with vector similarity — instead of requiring exact input matches, it returns cached responses for semantically equivalent queries.
How it works: Every incoming query is embedded and compared against cached query embeddings. If a query is above a similarity threshold (typically 0.95 cosine similarity), return the cached response. Otherwise, call the API and cache the new response.
Tools: GPTCache, Langchain's caching layer, Portkey's semantic cache, or a custom implementation with Pinecone/pgvector.
Pricing impact: Typically reduces API calls by 15-40% for user-facing applications with overlapping query patterns.
Type 4: KV cache management (self-hosted)
For teams running their own models (on Ollama, vLLM, or similar), the KV cache is the GPU memory that stores attention keys and values. Tuning KV cache size and prefix caching settings at the vLLM level directly affects throughput and effective cost per query.
This is only relevant for self-hosted deployments. For managed APIs, the provider handles KV cache for you.
Implementing prompt caching with Anthropic
Anthropic's cache_control mechanism is explicit — you mark which parts of your prompt to cache.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant for analyzing financial documents.",
},
{
"type": "text",
"text": long_document_context, # 10,000+ tokens
"cache_control": {"type": "ephemeral"}, # Mark for caching
}
],
messages=[{"role": "user", "content": user_question}]
)
On the second call with the same long_document_context, the cached tokens cost 90% less. A 10,000-token context repeated 1,000 times costs $30 without caching and $3 with caching (at Claude Sonnet pricing).
Implementing prompt caching with OpenAI
OpenAI's caching is automatic — no configuration required. As long as your prompt prefix is identical across requests and long enough (≥1,024 tokens), caching activates automatically. You can verify cache usage by checking the usage.prompt_tokens_details.cached_tokens field in the response.
To maximize cache hits with OpenAI:
- Put all static content at the beginning of your messages array
- Put variable content (user query) at the end
- Use consistent system prompts across your application
- Don't randomize or timestamp static content
Response caching in practice
A simple Redis-based response cache for LLM calls:
import hashlib, json, redis
r = redis.Redis()
CACHE_TTL = 86400 # 24 hours
def cached_llm_call(prompt: str, model: str, **kwargs) -> str:
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
cached = r.get(cache_key)
if cached:
return json.loads(cached)
response = call_llm_api(prompt, model, **kwargs)
r.setex(cache_key, CACHE_TTL, json.dumps(response))
return response
This is most effective at temperature=0. With temperature>0, responses are non-deterministic and cached responses may feel stale to users — use response caching only for tasks where output consistency is acceptable or desirable.
Which caching strategy to use when
| Use Case | Best Strategy | Expected Savings |
| RAG with fixed knowledge base | Prompt caching | 50-90% input cost |
| FAQ chatbot | Response caching + semantic cache | 40-80% total calls |
| Long system prompt bot | Prompt caching | 60-85% input cost |
| Document Q&A | Prompt caching (document as context) | Up to 90% |
| User analytics enrichment | Response caching | 20-60% depending on overlap |
| General chat | Prompt caching for system prompt | 30-50% input cost |
The bottom line
For most production LLM applications, combining prompt caching (for static context) with semantic caching (for recurring queries) is the highest-ROI optimization available. Teams that implement both typically cut API costs by 50-70% within a week of implementation.
Start with prompt caching — it requires zero infrastructure changes and delivers immediate savings. Add semantic caching once you have usage data showing recurring query patterns. Use the LLMversus cost calculator to model your specific savings before and after implementation.