LLM API Rate Limits Explained: Tokens, Requests, and How to Scale
Quick answer: LLM API rate limits operate on three dimensions — requests per minute (RPM), tokens per minute (TPM), and requests per day (RPD). The most commonly hit limit in production is TPM, not RPM. The fastest fix is exponential backoff with jitter; the long-term fix is concurrency management and provider tier upgrades.
What rate limits actually measure
LLM API rate limits measure three things, and most developers only track one:
Requests per minute (RPM): The number of API calls you can make per minute regardless of token count. A common trap: you might be far below your RPM limit while hitting your TPM limit if you're sending large prompts.
Tokens per minute (TPM): The total number of tokens (input + output) processed per minute. This is the limit most production systems hit first. One large RAG query with 8,000 input tokens and 500 output tokens uses 8,500 tokens against your TPM limit — equivalent to 17 average queries.
Requests per day (RPD): A daily cap on total requests, more commonly applied at free and low tiers. Enterprise tiers typically don't have RPD limits.
2026 Rate limits by provider and tier
OpenAI (GPT-4o)
- Free/Tier 1: 500 RPM, 30,000 TPM, 200 RPD
- Tier 2: 5,000 RPM, 450,000 TPM
- Tier 3: 10,000 RPM, 800,000 TPM
- Tier 4: 20,000 RPM, 2M TPM
- Tier 5 / Enterprise: Custom
Anthropic (Claude Sonnet 4)
- Build tier: 50 RPM, 50,000 TPM
- Scale tier: 1,000 RPM, 400,000 TPM
- Enterprise: Custom, typically higher
Google (Gemini 2.0 Flash)
- Free tier: 15 RPM, 1M TPM, 1,500 RPD
- Pay-as-you-go: 2,000 RPM, no TPM limit stated
Why you're hitting rate limits (and how to diagnose)
Before implementing fixes, diagnose which limit you're hitting:
- Check the error response — both OpenAI and Anthropic return error codes that specify which limit was exceeded (
rate_limit_exceededwith detail about RPM vs TPM) - Log your tokens per request — if your average request uses 5,000 tokens, you'll hit 200,000 TPM at just 40 RPM
- Check for request storms — multiple parallel features hitting the API simultaneously can spike TPM unexpectedly
Handling rate limits in production
Exponential backoff with jitter (minimum viable solution)
When you receive a 429 error, wait and retry. The key is adding random jitter to prevent all instances from retrying simultaneously:
import time, random
def call_with_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except RateLimitError as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
Request queuing with concurrency limits
For sustained high volume, implement a queue with a configurable concurrency limit that keeps you below your TPM limit:
import asyncio
from asyncio import Semaphore
sem = Semaphore(10) # Max 10 concurrent requests
async def throttled_call(prompt):
async with sem:
return await call_llm_api(prompt)
Token-aware throttling
Count tokens before sending and track your rolling TPM. Use tiktoken (OpenAI) or Anthropic's token counting endpoint to estimate token usage before requests.
Strategies for scaling beyond rate limits
1. Upgrade your tier: OpenAI's tiers auto-upgrade based on cumulative spend. Spend $500 to unlock Tier 2, $1,000 for Tier 3, etc. This is the simplest path.
2. Multi-provider routing: Use multiple API keys across providers. Route to Anthropic when OpenAI limits are hit, and vice versa. Tools like LiteLLM make this transparent.
3. Request a limit increase: Both OpenAI and Anthropic have limit increase request forms for enterprise customers. Response time is typically 3-5 business days.
4. Reduce token usage per request: Shorter prompts, output length constraints, and prompt caching reduce your effective TPM utilization without changing concurrency.
5. Use the Batch API: Batch API requests don't count against your synchronous rate limits. Migrate eligible workloads to batch processing to free up limits for realtime traffic.
Rate limits and the fastest LLM APIs
The fastest LLM APIs by throughput (Groq, Gemini 2.0 Flash Lite, GPT-4.1 Nano) also tend to have higher effective TPM limits because their lower per-token compute allows providers to offer more throughput per tier. If latency and throughput are critical, these models are worth evaluating.