Claude API Rate Limits 完整指南 (2026)

Claude API 速率限制：2026 年完整指南

如果你正在使用 Anthropic 的 Claude API 构建应用程序，了解速率限制（Rate Limits）至关重要。如果在错误的时间触发速率限制，你的应用程序将会停滞，用户会看到错误信息，且任务队列会造成堆积。本指南涵盖了每个速率限制层级、如何检测是否接近限制，以及处理这些限制的成熟策略。

Claude API 速率限制的工作原理

Anthropic 在 Claude API 上同时从三个维度强制执行速率限制：

维度	衡量指标	重置方式
每分钟请求数 (RPM)	API 调用次数	1 分钟滑动窗口
每分钟输入 Token 数 (ITPM)	发送到 API 的 Token 数量	1 分钟滑动窗口
每分钟输出 Token 数 (OTPM)	Claude 生成的 Token 数量	1 分钟滑动窗口

当这三个维度中的任何一个被超过时，你就会触发速率限制。这意味着即使你远低于 RPM 限制，发送几个超长的 Prompt 也可能会耗尽你的输入 Token 限制。

速率限制层级

Anthropic 使用基于你账户使用历史和支出的分层系统。截至 2026 年初，层级结构如下：

第 1 层 (Tier 1 - 新账户)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	50	20,000	4,000
Claude Sonnet 4	50	40,000	8,000
Claude Haiku 3.5	50	50,000	10,000

第 2 层 (Tier 2)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	1,000	80,000	16,000
Claude Sonnet 4	1,000	160,000	32,000
Claude Haiku 3.5	2,000	200,000	40,000

第 3 层 (Tier 3)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	2,000	400,000	80,000
Claude Sonnet 4	2,000	800,000	160,000
Claude Haiku 3.5	4,000	1,000,000	200,000

第 4 层 (Tier 4 - 高并发)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	4,000	2,000,000	400,000
Claude Sonnet 4	4,000	4,000,000	800,000
Claude Haiku 3.5	8,000	5,000,000	1,000,000

注意： 具体数值可能会有所不同。Anthropic 会定期调整这些限制，并可能为企业账户提供定制限制。请始终查看 Anthropic 官方文档以获取最新数据。

如何查看当前层级

你可以在 Anthropic Console 的 Settings > Limits 下查看你的层级和当前限制。随着你的账户不断累积支出，层级会自动升级。

速率限制响应标头 (Headers)

来自 Claude 的每个 API 响应都包含标头，告诉你相对于限制所处的状态：

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 998
anthropic-ratelimit-requests-reset: 2026-02-06T12:01:00Z
anthropic-ratelimit-tokens-limit: 160000
anthropic-ratelimit-tokens-remaining: 145230
anthropic-ratelimit-tokens-reset: 2026-02-06T12:01:00Z

标头	含义
`anthropic-ratelimit-requests-limit`	你的 RPM 限制
`anthropic-ratelimit-requests-remaining`	当前窗口内剩余的请求数
`anthropic-ratelimit-requests-reset`	请求计数器重置的时间
`anthropic-ratelimit-tokens-limit`	你的每分钟 Token 限制
`anthropic-ratelimit-tokens-remaining`	当前窗口内剩余的 Token 数
`anthropic-ratelimit-tokens-reset`	Token 计数器重置的时间

在代码中读取标头

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude!"}]
)

# 从响应标头中获取速率限制信息
print(f"Requests remaining: {response._response.headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Tokens remaining: {response._response.headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Resets at: {response._response.headers.get('anthropic-ratelimit-requests-reset')}")

当触发速率限制时会发生什么

当你超过任何维度的速率限制时，API 会返回 429 Too Many Requests 响应：

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the number of messages, and try again. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
  }
}

响应中还包含一个 retry-after 标头，指示在重试之前需要等待多少秒。

重试策略

基础指数退避 (Exponential Backoff)

最简单的方法是使用指数级增加的延迟进行重试：

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + 0.5  # 1.5s, 2.5s, 4.5s, 8.5s, 16.5s
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)

使用 `retry-after` 标头

更好的方法是读取 429 响应中的 retry-after 标头：

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry_after(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # 如果可用，使用 retry-after 标头，否则使用指数退避
            retry_after = getattr(e, 'response', None)
            if retry_after and retry_after.headers.get('retry-after'):
                wait_time = int(retry_after.headers['retry-after'])
            else:
                wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}...")
            time.sleep(wait_time)

令牌感知请求队列 (Token-Aware Request Queuing)

对于处理大量并发请求的生产系统，请实现令牌感知队列：

import asyncio
import time
from dataclasses import dataclass

@dataclass
class RateLimitState:
    requests_remaining: int = 1000
    tokens_remaining: int = 160000
    reset_time: float = 0.0

class TokenAwareQueue:
    def __init__(self, client):
        self.client = client
        self.state = RateLimitState()
        self.lock = asyncio.Lock()

    async def call(self, messages, estimated_tokens=500):
        async with self.lock:
            # 如果接近限制，则等待
            if self.state.tokens_remaining < estimated_tokens:
                wait_time = max(0, self.state.reset_time - time.time())
                if wait_time > 0:
                    await asyncio.sleep(wait_time)

            response = await self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )

            # 根据响应标头更新状态
            headers = response._response.headers
            self.state.requests_remaining = int(
                headers.get('anthropic-ratelimit-requests-remaining', 0)
            )
            self.state.tokens_remaining = int(
                headers.get('anthropic-ratelimit-tokens-remaining', 0)
            )

            return response

保持在速率限制内的最佳实践

1. 为任务选择合适的模型

不要把 Claude Haiku 就能处理的任务交给 Claude Opus。Haiku 的速率限制更高，而且价格便宜得多：

任务	推荐模型
简单分类	Haiku 3.5
摘要总结	Sonnet 4
代码生成	Sonnet 4
复杂推理	Opus 4
快速提取	Haiku 3.5

2. 减少输入 Token 使用量

精简系统提示词 (System Prompts)。 每个请求都会发送你的系统提示词。删掉不必要的指令。
使用对话摘要。 不要发送完整的对话历史，而是总结较旧的消息。
限制上下文。 仅包含模型实际需要的上下文。

# 差：针对一个简单问题发送整个文件内容
messages = [{"role": "user", "content": f"What language is this file? {entire_10000_line_file}"}]

# 好：只发送需要的部分
messages = [{"role": "user", "content": f"What language is this file? First 20 lines:\n{first_20_lines}"}]

3. 制定战略性的批量请求

如果你需要处理 100 个项目，不要同时发起 100 个请求。相反，通过并发限制来批量处理：

import asyncio

async def process_batch(items, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)

    async def process_one(item):
        async with semaphore:
            return await call_claude(item)

    results = await asyncio.gather(*[process_one(item) for item in items])
    return results

4. 使用 Message Batches API

对于非时间敏感的工作负载，Anthropic 的 Message Batches API 允许你在单个批处理中提交多达 10,000 个请求。批处理请求有独立的、高得多的限制，并在 24 小时内处理完成，还可享受 50% 的折扣。

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}],
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

5. 缓存重复请求

如果多个用户问类似的问题，请缓存响应：

import hashlib
import json

def get_cache_key(messages, model):
    content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

6. 使用提示词缓存 (Prompt Caching)

Anthropic 支持对系统提示词和长上下文进行提示词缓存。缓存的 Token 不计入后续请求的输入 Token 速率限制，且成本降低 90%：

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your very long system prompt here...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Your question"}]
)

监控速率限制使用情况

对于生产系统，记录你的速率限制标头并设置警报：

在使用率达到 80% 时报警，以便给自己留出反应时间
追踪模式以识别高峰时段
按模型监控，因为每个模型都有独立的限制

何时申请提高速率限制

如果优化后仍持续触发限制，请联系 Anthropic 销售部门获取定制方案。准备好以下信息：

当前的使用模式 (RPM, TPM)
未来 3-6 个月的预期增长
你的用例描述

大规模构建 AI 应用程序

在构建生产级 AI 应用程序时，速率限制只是难题的一部分。如果你的项目在文本生成的同时涉及多媒体生成（图像、视频、音频、数字人），可以考虑使用 Hypereal AI 等统一 API 平台。它可以处理跨多个 AI 模型的速率限制、排队和重试，让你专注于业务逻辑而非基础设施。

总结

管理 Claude API 速率限制归结为三个原则：了解你的限制（检查标头）、高效使用 Token（合适的模型、最小化的上下文）以及优雅地处理 429 错误（带有 retry-after 的指数退避）。实施这些策略，你的应用程序即使在高负载下也能保持可靠。

Claude API 速率限制：2026 年完整指南

Claude API 速率限制的工作原理

Anthropic 在 Claude API 上同时从三个维度强制执行速率限制：

维度	衡量指标	重置方式
每分钟请求数 (RPM)	API 调用次数	1 分钟滑动窗口
每分钟输入 Token 数 (ITPM)	发送到 API 的 Token 数量	1 分钟滑动窗口
每分钟输出 Token 数 (OTPM)	Claude 生成的 Token 数量	1 分钟滑动窗口

速率限制层级

Anthropic 使用基于你账户使用历史和支出的分层系统。截至 2026 年初，层级结构如下：

第 1 层 (Tier 1 - 新账户)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	50	20,000	4,000
Claude Sonnet 4	50	40,000	8,000
Claude Haiku 3.5	50	50,000	10,000

第 2 层 (Tier 2)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	1,000	80,000	16,000
Claude Sonnet 4	1,000	160,000	32,000
Claude Haiku 3.5	2,000	200,000	40,000

第 3 层 (Tier 3)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	2,000	400,000	80,000
Claude Sonnet 4	2,000	800,000	160,000
Claude Haiku 3.5	4,000	1,000,000	200,000

第 4 层 (Tier 4 - 高并发)

模型	RPM	输入 TPM	输出 TPM
Claude Opus 4	4,000	2,000,000	400,000
Claude Sonnet 4	4,000	4,000,000	800,000
Claude Haiku 3.5	8,000	5,000,000	1,000,000

注意： 具体数值可能会有所不同。Anthropic 会定期调整这些限制，并可能为企业账户提供定制限制。请始终查看 Anthropic 官方文档以获取最新数据。

如何查看当前层级

你可以在 Anthropic Console 的 Settings > Limits 下查看你的层级和当前限制。随着你的账户不断累积支出，层级会自动升级。

速率限制响应标头 (Headers)

来自 Claude 的每个 API 响应都包含标头，告诉你相对于限制所处的状态：

anthropic-ratelimit-requests-limit: 1000
anthropic-ratelimit-requests-remaining: 998
anthropic-ratelimit-requests-reset: 2026-02-06T12:01:00Z
anthropic-ratelimit-tokens-limit: 160000
anthropic-ratelimit-tokens-remaining: 145230
anthropic-ratelimit-tokens-reset: 2026-02-06T12:01:00Z

标头	含义
`anthropic-ratelimit-requests-limit`	你的 RPM 限制
`anthropic-ratelimit-requests-remaining`	当前窗口内剩余的请求数
`anthropic-ratelimit-requests-reset`	请求计数器重置的时间
`anthropic-ratelimit-tokens-limit`	你的每分钟 Token 限制
`anthropic-ratelimit-tokens-remaining`	当前窗口内剩余的 Token 数
`anthropic-ratelimit-tokens-reset`	Token 计数器重置的时间

在代码中读取标头

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude!"}]
)

# 从响应标头中获取速率限制信息
print(f"Requests remaining: {response._response.headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Tokens remaining: {response._response.headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Resets at: {response._response.headers.get('anthropic-ratelimit-requests-reset')}")

当触发速率限制时会发生什么

当你超过任何维度的速率限制时，API 会返回 429 Too Many Requests 响应：

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit (https://docs.anthropic.com/en/api/rate-limits); see the response headers for current usage. Please reduce the prompt length or the number of messages, and try again. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."
  }
}

响应中还包含一个 retry-after 标头，指示在重试之前需要等待多少秒。

重试策略

基础指数退避 (Exponential Backoff)

最简单的方法是使用指数级增加的延迟进行重试：

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + 0.5  # 1.5s, 2.5s, 4.5s, 8.5s, 16.5s
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)

使用 `retry-after` 标头

更好的方法是读取 429 响应中的 retry-after 标头：

import time
import anthropic

client = anthropic.Anthropic()

def call_claude_with_retry_after(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # 如果可用，使用 retry-after 标头，否则使用指数退避
            retry_after = getattr(e, 'response', None)
            if retry_after and retry_after.headers.get('retry-after'):
                wait_time = int(retry_after.headers['retry-after'])
            else:
                wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s before retry {attempt + 1}...")
            time.sleep(wait_time)

令牌感知请求队列 (Token-Aware Request Queuing)

对于处理大量并发请求的生产系统，请实现令牌感知队列：

import asyncio
import time
from dataclasses import dataclass

@dataclass
class RateLimitState:
    requests_remaining: int = 1000
    tokens_remaining: int = 160000
    reset_time: float = 0.0

class TokenAwareQueue:
    def __init__(self, client):
        self.client = client
        self.state = RateLimitState()
        self.lock = asyncio.Lock()

    async def call(self, messages, estimated_tokens=500):
        async with self.lock:
            # 如果接近限制，则等待
            if self.state.tokens_remaining < estimated_tokens:
                wait_time = max(0, self.state.reset_time - time.time())
                if wait_time > 0:
                    await asyncio.sleep(wait_time)

            response = await self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=messages,
            )

            # 根据响应标头更新状态
            headers = response._response.headers
            self.state.requests_remaining = int(
                headers.get('anthropic-ratelimit-requests-remaining', 0)
            )
            self.state.tokens_remaining = int(
                headers.get('anthropic-ratelimit-tokens-remaining', 0)
            )

            return response

保持在速率限制内的最佳实践

1. 为任务选择合适的模型

不要把 Claude Haiku 就能处理的任务交给 Claude Opus。Haiku 的速率限制更高，而且价格便宜得多：

任务	推荐模型
简单分类	Haiku 3.5
摘要总结	Sonnet 4
代码生成	Sonnet 4
复杂推理	Opus 4
快速提取	Haiku 3.5

2. 减少输入 Token 使用量

精简系统提示词 (System Prompts)。 每个请求都会发送你的系统提示词。删掉不必要的指令。
使用对话摘要。 不要发送完整的对话历史，而是总结较旧的消息。
限制上下文。 仅包含模型实际需要的上下文。

# 差：针对一个简单问题发送整个文件内容
messages = [{"role": "user", "content": f"What language is this file? {entire_10000_line_file}"}]

# 好：只发送需要的部分
messages = [{"role": "user", "content": f"What language is this file? First 20 lines:\n{first_20_lines}"}]

3. 制定战略性的批量请求

如果你需要处理 100 个项目，不要同时发起 100 个请求。相反，通过并发限制来批量处理：

import asyncio

async def process_batch(items, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)

    async def process_one(item):
        async with semaphore:
            return await call_claude(item)

    results = await asyncio.gather(*[process_one(item) for item in items])
    return results

4. 使用 Message Batches API

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}],
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

5. 缓存重复请求

如果多个用户问类似的问题，请缓存响应：

import hashlib
import json

def get_cache_key(messages, model):
    content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

6. 使用提示词缓存 (Prompt Caching)

Anthropic 支持对系统提示词和长上下文进行提示词缓存。缓存的 Token 不计入后续请求的输入 Token 速率限制，且成本降低 90%：

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your very long system prompt here...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Your question"}]
)

监控速率限制使用情况

对于生产系统，记录你的速率限制标头并设置警报：

在使用率达到 80% 时报警，以便给自己留出反应时间
追踪模式以识别高峰时段
按模型监控，因为每个模型都有独立的限制

何时申请提高速率限制

如果优化后仍持续触发限制，请联系 Anthropic 销售部门获取定制方案。准备好以下信息：

当前的使用模式 (RPM, TPM)
未来 3-6 个月的预期增长
你的用例描述

开始使用 Hypereal 构建

Claude API 速率限制：2026 年完整指南

Claude API 速率限制的工作原理

速率限制层级

第 1 层 (Tier 1 - 新账户)

第 2 层 (Tier 2)

第 3 层 (Tier 3)

第 4 层 (Tier 4 - 高并发)

如何查看当前层级

速率限制响应标头 (Headers)

在代码中读取标头

当触发速率限制时会发生什么

重试策略

基础指数退避 (Exponential Backoff)

使用 `retry-after` 标头

令牌感知请求队列 (Token-Aware Request Queuing)

保持在速率限制内的最佳实践

1. 为任务选择合适的模型

2. 减少输入 Token 使用量

3. 制定战略性的批量请求

4. 使用 Message Batches API

5. 缓存重复请求

6. 使用提示词缓存 (Prompt Caching)

监控速率限制使用情况

何时申请提高速率限制

大规模构建 AI 应用程序

总结

相关文章

Claude API 费用：完整价格计算器 (2026)

Claude Code CLI 命令：完整备忘单 (2026)

Claude Opus 4.5 价格详解：完整成本分析 (2026)

立即开始构建

开始使用 Hypereal 构建

Claude API 速率限制：2026 年完整指南

Claude API 速率限制的工作原理

速率限制层级

第 1 层 (Tier 1 - 新账户)

第 2 层 (Tier 2)

第 3 层 (Tier 3)

第 4 层 (Tier 4 - 高并发)

如何查看当前层级

速率限制响应标头 (Headers)

在代码中读取标头

当触发速率限制时会发生什么

重试策略

基础指数退避 (Exponential Backoff)

使用 `retry-after` 标头

令牌感知请求队列 (Token-Aware Request Queuing)

保持在速率限制内的最佳实践

1. 为任务选择合适的模型

2. 减少输入 Token 使用量

3. 制定战略性的批量请求

4. 使用 Message Batches API

5. 缓存重复请求

6. 使用提示词缓存 (Prompt Caching)

监控速率限制使用情况

何时申请提高速率限制

大规模构建 AI 应用程序

总结

相关文章

Claude API 费用：完整价格计算器 (2026)

Claude Code CLI 命令：完整备忘单 (2026)

Claude Opus 4.5 价格详解：完整成本分析 (2026)

立即开始构建