Claude API Rate Limits: Complete Guide (2026)
Every rate limit tier, header, and best practice for the Anthropic API
Hypereal로 구축 시작하기
단일 API를 통해 Kling, Flux, Sora, Veo 등에 액세스하세요. 무료 크레딧으로 시작하고 수백만으로 확장하세요.
신용카드 불필요 • 10만 명 이상의 개발자 • 엔터프라이즈 지원
Claude API Rate Limits: Complete Guide for 2026
If you are building applications with the Anthropic Claude API, understanding rate limits is critical. Hit a rate limit at the wrong time and your application stalls, users see errors, and your queue backs up. This guide covers every rate limit tier, how to detect when you are approaching limits, and proven strategies for handling them gracefully.
How Claude API Rate Limits Work
Anthropic enforces rate limits on the Claude API using three dimensions simultaneously:
| Dimension | What It Measures | How It Resets |
|---|---|---|
| Requests per minute (RPM) | Number of API calls | Rolling 1-minute window |
| Input tokens per minute (ITPM) | Tokens sent to the API | Rolling 1-minute window |
| Output tokens per minute (OTPM) | Tokens generated by Claude | Rolling 1-minute window |
You hit a rate limit when any one of these three dimensions is exceeded. This means even if you are well under your RPM limit, sending a few very long prompts can max out your input token limit.
Rate Limit Tiers
Anthropic uses a tiered system based on your account's usage history and spend. As of early 2026, the tiers are structured as follows:
Tier 1 (New Accounts)
| Model | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Claude Opus 4 | 50 | 20,000 | 4,000 |
| Claude Sonnet 4 | 50 | 40,000 | 8,000 |
| Claude Haiku 3.5 | 50 | 50,000 | 10,000 |
Tier 2
| Model | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Claude Opus 4 | 1,000 | 80,000 | 16,000 |
| Claude Sonnet 4 | 1,000 | 160,000 | 32,000 |
| Claude Haiku 3.5 | 2,000 | 200,000 | 40,000 |
Tier 3
| Model | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Claude Opus 4 | 2,000 | 400,000 | 80,000 |
| Claude Sonnet 4 | 2,000 | 800,000 | 160,000 |
| Claude Haiku 3.5 | 4,000 | 1,000,000 | 200,000 |
Tier 4 (High Volume)
| Model | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Claude Opus 4 | 4,000 | 2,000,000 | 400,000 |
| Claude Sonnet 4 | 4,000 | 4,000,000 | 800,000 |
| Claude Haiku 3.5 | 8,000 | 5,000,000 | 1,000,000 |
Note: Exact numbers may vary. Anthropic adjusts these limits periodically and may offer custom limits for enterprise accounts. Always check the official Anthropic documentation for the most current figures.
How to Check Your Current Tier
You can check your tier and current limits in the Anthropic Console under Settings > Limits. Your tier automatically upgrades as your account accumulates spend over time.
Rate Limit Response Headers
Every API response from Claude includes headers that tell you exactly where you stand relative to your limits:
anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 998
anthropic-ratelimit-requests-reset: 2026-02-06T12:01:00Z
anthropic-ratelimit-tokens-limit: 160000
anthropic-ratelimit-tokens-remaining: 145230
anthropic-ratelimit-tokens-reset: 2026-02-06T12:01:00Z
| Header | Meaning |
|---|---|
anthropic-ratelimit-requests-limit |
Your RPM limit |
anthropic-ratelimit-requests-remaining |
Requests left in the current window |
anthropic-ratelimit-requests-reset |
When the request counter resets |
anthropic-ratelimit-tokens-limit |
Your token-per-minute limit |
anthropic-ratelimit-tokens-remaining |
Tokens remaining in the current window |
anthropic-ratelimit-tokens-reset |
When the token counter resets |
Reading Headers in Code
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello, Claude!"}]
)
# Access rate limit info from the response headers
print(f"Requests remaining: {response._response.headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Tokens remaining: {response._response.headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Resets at: {response._response.headers.get('anthropic-ratelimit-requests-reset')}")
What Happens When You Hit a Rate Limit
When you exceed any rate limit dimension, the API returns a 429 Too Many Requests response:
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the number of messages, and try again. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
}
}
The response also includes a retry-after header indicating how many seconds to wait before retrying.
Retry Strategies
Basic Exponential Backoff
The simplest approach is to retry with exponentially increasing delays:
import time
import anthropic
client = anthropic.Anthropic()
def call_claude_with_retry(messages, max_retries=5):
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=messages,
)
return response
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + 0.5 # 1.5s, 2.5s, 4.5s, 8.5s, 16.5s
print(f"Rate limited. Retrying in {wait_time}s...")
time.sleep(wait_time)
Using the `retry-after` Header
A better approach reads the retry-after header from the 429 response:
import time
import anthropic
client = anthropic.Anthropic()
def call_claude_with_retry_after(messages, max_retries=5):
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=messages,
)
return response
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
# Use retry-after header if available, otherwise exponential backoff
retry_after = getattr(e, 'response', None)
if retry_after and retry_after.headers.get('retry-after'):
wait_time = int(retry_after.headers['retry-after'])
else:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}...")
time.sleep(wait_time)
Token-Aware Request Queuing
For production systems handling many concurrent requests, implement a token-aware queue:
import asyncio
import time
from dataclasses import dataclass
@dataclass
class RateLimitState:
requests_remaining: int = 1000
tokens_remaining: int = 160000
reset_time: float = 0.0
class TokenAwareQueue:
def __init__(self, client):
self.client = client
self.state = RateLimitState()
self.lock = asyncio.Lock()
async def call(self, messages, estimated_tokens=500):
async with self.lock:
# Wait if we are close to the limit
if self.state.tokens_remaining < estimated_tokens:
wait_time = max(0, self.state.reset_time - time.time())
if wait_time > 0:
await asyncio.sleep(wait_time)
response = await self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=messages,
)
# Update state from response headers
headers = response._response.headers
self.state.requests_remaining = int(
headers.get('anthropic-ratelimit-requests-remaining', 0)
)
self.state.tokens_remaining = int(
headers.get('anthropic-ratelimit-tokens-remaining', 0)
)
return response
Best Practices for Staying Under Rate Limits
1. Use the Right Model for the Job
Do not use Claude Opus for tasks that Claude Haiku can handle. Haiku has higher rate limits and is significantly cheaper:
| Task | Recommended Model |
|---|---|
| Simple classification | Haiku 3.5 |
| Summarization | Sonnet 4 |
| Code generation | Sonnet 4 |
| Complex reasoning | Opus 4 |
| Quick extraction | Haiku 3.5 |
2. Reduce Input Token Usage
- Trim system prompts. Every request sends your system prompt. Cut unnecessary instructions.
- Use conversation summaries. Instead of sending entire conversation histories, summarize older messages.
- Limit context. Only include the context the model actually needs.
# Bad: Sending entire file content for a simple question
messages = [{"role": "user", "content": f"What language is this file? {entire_10000_line_file}"}]
# Good: Send only what's needed
messages = [{"role": "user", "content": f"What language is this file? First 20 lines:\n{first_20_lines}"}]
3. Batch Requests Strategically
If you need to process 100 items, do not fire 100 simultaneous requests. Instead, batch them with concurrency limits:
import asyncio
async def process_batch(items, max_concurrent=5):
semaphore = asyncio.Semaphore(max_concurrent)
async def process_one(item):
async with semaphore:
return await call_claude(item)
results = await asyncio.gather(*[process_one(item) for item in items])
return results
4. Use the Message Batches API
For non-time-sensitive workloads, Anthropic's Message Batches API lets you submit up to 10,000 requests in a single batch. Batch requests have separate, much higher limits and are processed within 24 hours at a 50% discount.
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"request-{i}",
"params": {
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}],
}
}
for i, prompt in enumerate(prompts)
]
)
5. Cache Repeated Requests
If multiple users ask similar questions, cache the responses:
import hashlib
import json
def get_cache_key(messages, model):
content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
6. Use Prompt Caching
Anthropic supports prompt caching for system prompts and long context. Cached tokens do not count toward your input token rate limit on subsequent requests and cost 90% less:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "Your very long system prompt here...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Your question"}]
)
Monitoring Rate Limit Usage
For production systems, log your rate limit headers and set up alerts:
- Alert at 80% usage to give yourself time to react
- Track patterns to identify peak hours
- Monitor by model since each model has independent limits
When to Request a Rate Limit Increase
If you consistently hit limits despite optimization, contact Anthropic sales for a custom plan. Be prepared with:
- Your current usage patterns (RPM, TPM)
- Expected growth over the next 3-6 months
- Your use case description
Building AI Applications at Scale
Rate limits are one piece of the puzzle when building production AI applications. If your project involves media generation (images, video, audio, avatars) alongside text generation, consider using a unified API platform like Hypereal AI that handles rate limiting, queuing, and retries across multiple AI models, so you can focus on your application logic instead of infrastructure.
Summary
Managing Claude API rate limits comes down to three principles: know your limits (check headers), use tokens efficiently (right model, minimal context), and handle 429 errors gracefully (exponential backoff with retry-after). Implement these strategies and your application will stay reliable even under heavy load.
