prompt-cachingllm-costapi-optimizationkv-cachesemantic-cache

LLM API Caching Strategies: Cut Costs Up to 90% in 2026

LLM API Caching Strategies: Cut Costs Up to 90% in 2026

Quick answer: There are four distinct types of LLM caching, and most teams use only one (if any). Prompt caching alone can cut input costs by 80-90% on workloads with repeated context. Semantic caching can eliminate 20-40% of API calls entirely. Both are available today with minimal engineering effort.


The four types of LLM caching

Before diving into implementation, it helps to understand that "caching" in the LLM context refers to four different things that operate at different layers.

Type 1: Prompt caching (provider-side)

OpenAI and Anthropic cache the KV (key-value) representations of your input tokens on their infrastructure. When you send a request that starts with the same prefix as a previous request, the model skips recomputing those tokens and serves the cached KV state.

Pricing impact:

  • OpenAI: cached input tokens cost 50% less than standard
  • Anthropic: cached input tokens cost 90% less (cache writes are 25% more, but amortize quickly)

Requirements:

  • The cached prefix must be at least 1,024 tokens (OpenAI) or 1,024 tokens (Anthropic)
  • Requests must arrive within the cache TTL (5 minutes for Anthropic by default, automatic for OpenAI)
  • The prefix must appear at the start of the message, before any variable content

Best use cases: RAG systems with a fixed document corpus, chatbots with a long system prompt, API wrappers that prepend the same instructions to every call.

Type 2: Response caching (application-side)

You cache the LLM's complete response at the application layer — in Redis, Memcached, or a database. When an identical or near-identical request comes in, serve the cached response without hitting the API.

Pricing impact: 100% cost savings for cache hits — you pay nothing for cached responses.

Requirements: Deterministic or near-deterministic request patterns. Works best with temperature=0 or very low temperature settings.

Best use cases: FAQ answering, product description generation from fixed templates, standard contract clause lookup.

Type 3: Semantic caching

Semantic caching extends response caching with vector similarity — instead of requiring exact input matches, it returns cached responses for semantically equivalent queries.

How it works: Every incoming query is embedded and compared against cached query embeddings. If a query is above a similarity threshold (typically 0.95 cosine similarity), return the cached response. Otherwise, call the API and cache the new response.

Tools: GPTCache, Langchain's caching layer, Portkey's semantic cache, or a custom implementation with Pinecone/pgvector.

Pricing impact: Typically reduces API calls by 15-40% for user-facing applications with overlapping query patterns.

Type 4: KV cache management (self-hosted)

For teams running their own models (on Ollama, vLLM, or similar), the KV cache is the GPU memory that stores attention keys and values. Tuning KV cache size and prefix caching settings at the vLLM level directly affects throughput and effective cost per query.

This is only relevant for self-hosted deployments. For managed APIs, the provider handles KV cache for you.


Implementing prompt caching with Anthropic

Anthropic's cache_control mechanism is explicit — you mark which parts of your prompt to cache.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant for analyzing financial documents.",
        },
        {
            "type": "text",
            "text": long_document_context,  # 10,000+ tokens
            "cache_control": {"type": "ephemeral"},  # Mark for caching
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)

On the second call with the same long_document_context, the cached tokens cost 90% less. A 10,000-token context repeated 1,000 times costs $30 without caching and $3 with caching (at Claude Sonnet pricing).


Implementing prompt caching with OpenAI

OpenAI's caching is automatic — no configuration required. As long as your prompt prefix is identical across requests and long enough (≥1,024 tokens), caching activates automatically. You can verify cache usage by checking the usage.prompt_tokens_details.cached_tokens field in the response.

To maximize cache hits with OpenAI:

  1. Put all static content at the beginning of your messages array
  2. Put variable content (user query) at the end
  3. Use consistent system prompts across your application
  4. Don't randomize or timestamp static content


Response caching in practice

A simple Redis-based response cache for LLM calls:

import hashlib, json, redis

r = redis.Redis()
CACHE_TTL = 86400  # 24 hours

def cached_llm_call(prompt: str, model: str, **kwargs) -> str:
    cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)
    
    response = call_llm_api(prompt, model, **kwargs)
    r.setex(cache_key, CACHE_TTL, json.dumps(response))
    return response

This is most effective at temperature=0. With temperature>0, responses are non-deterministic and cached responses may feel stale to users — use response caching only for tasks where output consistency is acceptable or desirable.


Which caching strategy to use when

Use CaseBest StrategyExpected Savings
RAG with fixed knowledge basePrompt caching50-90% input cost
FAQ chatbotResponse caching + semantic cache40-80% total calls
Long system prompt botPrompt caching60-85% input cost
Document Q&APrompt caching (document as context)Up to 90%
User analytics enrichmentResponse caching20-60% depending on overlap
General chatPrompt caching for system prompt30-50% input cost


The bottom line

For most production LLM applications, combining prompt caching (for static context) with semantic caching (for recurring queries) is the highest-ROI optimization available. Teams that implement both typically cut API costs by 50-70% within a week of implementation.

Start with prompt caching — it requires zero infrastructure changes and delivers immediate savings. Add semantic caching once you have usage data showing recurring query patterns. Use the LLMversus cost calculator to model your specific savings before and after implementation.

Your ad here

Related Tools