Grok 3 API Rate Limits: Complete Guide (2026)
Everything you need to know about xAI Grok 3 API rate limits and quotas
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
Grok 3 API Rate Limits: Complete Guide for 2026
xAI's Grok 3 is a strong contender in the frontier model space, and their API provides OpenAI-compatible access to the full Grok model family. Whether you are building applications, running batch processing, or integrating Grok into your workflow, understanding the rate limits is essential to avoid disruptions and design reliable systems.
This guide covers every aspect of Grok 3 API rate limits, including tier breakdowns, headers, error handling, and practical strategies for working within the limits.
Grok 3 API Model Overview
Before diving into limits, here are the models available through the xAI API:
| Model | Context Window | Strengths | Use Case |
|---|---|---|---|
grok-3 |
131,072 tokens | Frontier reasoning, analysis | Complex tasks, research |
grok-3-fast |
131,072 tokens | Faster responses, slightly lower quality | Real-time apps, chat |
grok-3-mini |
131,072 tokens | Efficient, cost-effective | Simple tasks, high volume |
grok-3-mini-fast |
131,072 tokens | Fastest, most affordable | Latency-sensitive apps |
All models share the same 131K context window, which is competitive with GPT-4o (128K) and Claude (200K).
Rate Limit Tiers
xAI uses a tiered rate limit system based on your account spending history. As you spend more, your limits automatically increase.
Free Tier
New accounts with an API key get a free monthly credit to explore the API:
| Limit Type | Free Tier |
|---|---|
| Monthly credit | $25 |
| Requests per minute | 5 |
| Requests per hour | 60 |
| Tokens per minute (input) | 15,000 |
| Tokens per minute (output) | 5,000 |
| Concurrent requests | 2 |
Tier 1 ($0+ spent)
Once you add a payment method and make any purchase:
| Limit Type | Tier 1 |
|---|---|
| Requests per minute | 60 |
| Requests per hour | 1,000 |
| Tokens per minute (input) | 100,000 |
| Tokens per minute (output) | 25,000 |
| Concurrent requests | 10 |
Tier 2 ($100+ spent)
| Limit Type | Tier 2 |
|---|---|
| Requests per minute | 200 |
| Requests per hour | 5,000 |
| Tokens per minute (input) | 500,000 |
| Tokens per minute (output) | 100,000 |
| Concurrent requests | 25 |
Tier 3 ($500+ spent)
| Limit Type | Tier 3 |
|---|---|
| Requests per minute | 500 |
| Requests per hour | 20,000 |
| Tokens per minute (input) | 1,000,000 |
| Tokens per minute (output) | 250,000 |
| Concurrent requests | 50 |
Tier 4 ($1,000+ spent)
| Limit Type | Tier 4 |
|---|---|
| Requests per minute | 1,000 |
| Requests per hour | 50,000 |
| Tokens per minute (input) | 2,000,000 |
| Tokens per minute (output) | 500,000 |
| Concurrent requests | 100 |
Rate Limit Headers
Every API response includes headers that tell you your current rate limit status:
x-ratelimit-limit-requests: 60
x-ratelimit-limit-tokens: 100000
x-ratelimit-remaining-requests: 58
x-ratelimit-remaining-tokens: 95234
x-ratelimit-reset-requests: 2026-02-06T12:00:32Z
x-ratelimit-reset-tokens: 2026-02-06T12:00:15Z
Header reference:
| Header | Description |
|---|---|
x-ratelimit-limit-requests |
Max requests per minute for your tier |
x-ratelimit-limit-tokens |
Max tokens per minute for your tier |
x-ratelimit-remaining-requests |
Requests remaining in current window |
x-ratelimit-remaining-tokens |
Tokens remaining in current window |
x-ratelimit-reset-requests |
When the request limit resets |
x-ratelimit-reset-tokens |
When the token limit resets |
Making Your First API Call
The Grok API uses the OpenAI-compatible format, making it easy to integrate with existing code:
import openai
client = openai.OpenAI(
api_key="YOUR_XAI_API_KEY",
base_url="https://api.x.ai/v1"
)
response = client.chat.completions.create(
model="grok-3",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the PageRank algorithm."}
],
temperature=0.7,
max_tokens=2048
)
print(response.choices[0].message.content)
Handling Rate Limit Errors
When you exceed rate limits, the API returns a 429 Too Many Requests status code. Here is how to handle it properly:
import openai
import time
from openai import RateLimitError
client = openai.OpenAI(
api_key="YOUR_XAI_API_KEY",
base_url="https://api.x.ai/v1"
)
def call_grok_with_retry(messages, model="grok-3", max_retries=5):
"""Call Grok API with exponential backoff on rate limits."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.7,
max_tokens=2048
)
return response.choices[0].message.content
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff: 1s, 2s, 4s, 8s, 16s
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s before retry...")
time.sleep(wait_time)
except Exception as e:
print(f"Unexpected error: {e}")
raise
# Usage
result = call_grok_with_retry([
{"role": "user", "content": "Summarize the history of the internet."}
])
print(result)
Advanced: Rate Limiter Class
For production applications, use a rate limiter that tracks your usage proactively:
import time
import threading
from collections import deque
class GrokRateLimiter:
"""Proactive rate limiter for the Grok API."""
def __init__(self, max_requests_per_minute=60, max_tokens_per_minute=100000):
self.max_rpm = max_requests_per_minute
self.max_tpm = max_tokens_per_minute
self.request_timestamps = deque()
self.token_counts = deque()
self.lock = threading.Lock()
def wait_if_needed(self, estimated_tokens=1000):
"""Block until the request can be made within rate limits."""
with self.lock:
now = time.time()
window_start = now - 60
# Clean old entries
while self.request_timestamps and self.request_timestamps[0] < window_start:
self.request_timestamps.popleft()
self.token_counts.popleft()
# Check request limit
if len(self.request_timestamps) >= self.max_rpm:
sleep_time = self.request_timestamps[0] - window_start
time.sleep(max(sleep_time, 0.1))
# Check token limit
current_tokens = sum(self.token_counts)
if current_tokens + estimated_tokens > self.max_tpm:
sleep_time = self.token_counts[0] - window_start if self.token_counts else 1
time.sleep(max(sleep_time, 0.1))
# Record this request
self.request_timestamps.append(time.time())
self.token_counts.append(estimated_tokens)
# Usage
limiter = GrokRateLimiter(max_requests_per_minute=60, max_tokens_per_minute=100000)
for prompt in prompts:
limiter.wait_if_needed(estimated_tokens=500)
result = call_grok_with_retry([{"role": "user", "content": prompt}])
Grok 3 API Pricing
Rate limits are not the only consideration. Here is the pricing for each model:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
grok-3 |
$3.00 | $15.00 |
grok-3-fast |
$5.00 | $25.00 |
grok-3-mini |
$0.30 | $0.50 |
grok-3-mini-fast |
$0.10 | $0.50 |
Note that grok-3-fast is more expensive than grok-3 despite being a faster, slightly lower-quality variant. For cost-sensitive applications, grok-3-mini and grok-3-mini-fast offer significant savings.
Strategies for Staying Within Limits
1. Use the Right Model for the Task
Do not use grok-3 for tasks that grok-3-mini-fast can handle:
def choose_model(task_complexity):
"""Select the appropriate model based on task complexity."""
if task_complexity == "simple":
return "grok-3-mini-fast" # Classification, extraction, formatting
elif task_complexity == "medium":
return "grok-3-mini" # Summarization, Q&A, translation
elif task_complexity == "complex":
return "grok-3-fast" # Analysis, coding, creative writing
else:
return "grok-3" # Research, complex reasoning, multi-step
2. Batch Requests When Possible
Combine multiple small prompts into a single request:
# Instead of 5 separate requests:
# "Translate X to French", "Translate Y to French", ...
# Use one request:
prompt = """Translate each of the following to French:
1. Hello, how are you?
2. Where is the nearest train station?
3. I would like to order coffee.
4. What time does the museum close?
5. Thank you very much."""
# 1 request instead of 5
3. Cache Responses
Store responses locally to avoid repeating identical queries:
import hashlib
import json
import os
CACHE_DIR = ".grok_cache"
os.makedirs(CACHE_DIR, exist_ok=True)
def cached_grok_call(messages, model="grok-3"):
"""Cache API responses to avoid duplicate calls."""
cache_key = hashlib.md5(
json.dumps({"messages": messages, "model": model}).encode()
).hexdigest()
cache_path = os.path.join(CACHE_DIR, f"{cache_key}.json")
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
return json.load(f)["response"]
response = call_grok_with_retry(messages, model=model)
with open(cache_path, "w") as f:
json.dump({"response": response}, f)
return response
4. Monitor Usage with Headers
Parse rate limit headers to make informed decisions:
def call_with_monitoring(messages, model="grok-3"):
"""Make an API call and report rate limit status."""
response = client.chat.completions.with_raw_response.create(
model=model,
messages=messages,
max_tokens=2048
)
headers = response.headers
remaining_requests = int(headers.get("x-ratelimit-remaining-requests", 0))
remaining_tokens = int(headers.get("x-ratelimit-remaining-tokens", 0))
if remaining_requests < 5:
print(f"Warning: Only {remaining_requests} requests remaining this minute")
if remaining_tokens < 5000:
print(f"Warning: Only {remaining_tokens} tokens remaining this minute")
return response.parse().choices[0].message.content
Grok 3 vs Competitors: Rate Limit Comparison
| Provider | Free Tier RPM | Paid Tier RPM | Free Credits |
|---|---|---|---|
| xAI (Grok 3) | 5 | 60-1,000+ | $25/month |
| OpenAI (GPT-4o) | 3 | 500-10,000 | $5 one-time |
| Anthropic (Claude) | 5 | 50-4,000 | $5 one-time |
| Google (Gemini 2.5 Pro) | 5 | 360-1,000 | $0 (free tier) |
Conclusion
The Grok 3 API offers competitive rate limits that scale with your spending, an OpenAI-compatible interface for easy integration, and a generous free tier for getting started. By choosing the right model tier, implementing proper retry logic, and caching responses, you can build reliable applications that work well within the limits.
If your project requires AI-generated media alongside Grok-powered text and reasoning -- such as AI avatars, text-to-video, or voice cloning -- Hypereal AI offers straightforward pay-as-you-go API access to production-grade generative media models that integrate seamlessly with your existing AI stack.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
