Why Rate Limiting Exists
Rate limiting is not just a defensive measure — it’s a first-class systems design concern that touches availability, fairness, cost control, and security simultaneously.
The four reasons you rate limit:
-
DDoS and abuse prevention — A single client sending 100,000 requests/second can saturate your backend. Even well-intentioned clients with buggy retry logic can create accidental DDoS.
-
Cost control — Every request costs compute, database reads, third-party API calls (OpenAI, payment processors). A single runaway script can generate thousands of dollars in cloud bills overnight.
-
Fairness — In a multi-tenant system, one power user consuming 90% of capacity degrades service for everyone else. Rate limits enforce equitable resource sharing.
-
SLA protection — Downstream services (payment gateways, SMS providers) have their own rate limits. Your rate limiter protects you from propagating failures upstream.
Real numbers:
- GitHub: 5,000 requests/hour per authenticated user, 60/hour unauthenticated
- Stripe: 100 requests/second per secret key (burst up to 200)
- Twitter API v2: varies by tier, 500k tweets/month at basic
- Cloudflare: rate limiting rules configurable to any threshold, enforced at edge
Algorithm 1: Fixed Window Counter
The simplest algorithm. Divide time into fixed windows (1 minute, 1 hour). Count requests in the current window. Reject if count exceeds limit.
Timeline (limit: 10 req/min):
Minute 1 [00:00 - 00:59] |██████████░░| 10 requests — full
Minute 2 [01:00 - 01:59] |░░░░░░░░░░░░| 0 requests — reset
async function isAllowed(clientId: string, limit: number): Promise<boolean> {
const windowKey = `ratelimit:${clientId}:${Math.floor(Date.now() / 60000)}`;
const count = await redis.incr(windowKey);
if (count === 1) {
// Set expiry only on first increment to avoid race condition
await redis.expire(windowKey, 60);
}
return count <= limit;
}
The boundary burst problem (critical flaw):
Limit: 10 req/min
00:59 — client sends 10 requests (window 1: count = 10, full)
01:00 — window resets
01:01 — client sends 10 more requests (window 2: count = 10)
In 2 seconds, the client sent 20 requests — 2x the limit
with zero rate limiting penalty.
This is the boundary burst: at every window boundary, an attacker can send 2× the limit. This is exploitable and means fixed window does not guarantee your stated limit.
When to use: Internal admin tools, low-stakes APIs, or when simplicity outweighs precision. Never use for security-critical rate limiting.
Space: O(1) per client (single counter). Time: O(1). Redis ops: 2 (INCR + EXPIRE on first request).
Algorithm 2: Sliding Window Log
Maintain a timestamped log of every request. On each request, remove entries older than the window, count remaining entries.
async function isAllowed(
clientId: string,
limit: number,
windowMs: number
): Promise<boolean> {
const now = Date.now();
const windowStart = now - windowMs;
const key = `ratelimit:log:${clientId}`;
// Atomic Lua script: remove old entries, add current, count
const count = await redis.eval(`
local key = KEYS[1]
local window_start = ARGV[1]
local now = ARGV[2]
local limit = tonumber(ARGV[3])
local window_ms = tonumber(ARGV[4])
-- Remove all entries older than window
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
-- Add current request with score = timestamp
redis.call('ZADD', key, now, now)
-- Set expiry
redis.call('PEXPIRE', key, window_ms)
-- Count requests in window
return redis.call('ZCARD', key)
`, 1, key, windowStart, now, limit, windowMs) as number;
return count <= limit;
}
Accurate but memory-expensive:
1 client, 1000 req/min limit, each entry = 16 bytes:
Per client: 1000 × 16 bytes = 16 KB
At 1M clients: 16 GB RAM just for rate limit state
Compare to fixed window:
Per client: 8 bytes (single integer)
At 1M clients: 8 MB
When to use: When accuracy is paramount and client count is bounded. Suitable for per-user limits where you have <10K active users. Not viable for high-cardinality public APIs.
Algorithm 3: Sliding Window Counter (Approximation)
The clever middle ground. Uses two fixed windows and weights them by overlap with the current sliding window.
Window size: 60 seconds
Current time: 01:15 (i.e., 15 seconds into minute 1)
Previous window (00:00–00:59): 8 requests
Current window (01:00–01:59): 3 requests
Weight of previous window = (60 - 15) / 60 = 0.75
Approximate count = 8 × 0.75 + 3 × 1.0
= 6 + 3 = 9 requests in sliding window
async function isAllowed(
clientId: string,
limit: number,
windowMs: number
): Promise<boolean> {
const now = Date.now();
const windowStart = Math.floor(now / windowMs) * windowMs;
const prevWindowStart = windowStart - windowMs;
const currentKey = `ratelimit:${clientId}:${windowStart}`;
const prevKey = `ratelimit:${clientId}:${prevWindowStart}`;
const [currentCount, prevCount] = await redis.mget(currentKey, prevKey);
const current = parseInt(currentCount ?? '0');
const prev = parseInt(prevCount ?? '0');
// How far into the current window are we?
const elapsed = now - windowStart;
const prevWeight = (windowMs - elapsed) / windowMs;
const approximateCount = prev * prevWeight + current;
if (approximateCount >= limit) {
return false;
}
// Increment current window
await redis.multi()
.incr(currentKey)
.expire(currentKey, Math.ceil(windowMs / 1000) * 2)
.exec();
return true;
}
Error bound: This approximation assumes requests in the previous window were uniformly distributed — which is usually true at scale. The maximum error is <1% of the limit in practice (Cloudflare found this acceptable for their use case). Memory cost: O(2) per client — just two counters.
Used by: Cloudflare’s rate limiting, Redis’s own rate limiting documentation.
Algorithm 4: Token Bucket
The most widely used algorithm in production systems. Conceptually: a bucket holds tokens. Each request consumes one token. Tokens refill at a fixed rate. If the bucket is empty, reject the request.
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
t=0: bucket = 100 (full)
t=0: burst of 100 requests → bucket = 0
t=1: 10 tokens added → bucket = 10
t=1: 10 requests → bucket = 0
t=5: 50 tokens added → bucket = 50
Key insight: Token bucket allows bursting up to the bucket capacity. A client can send 100 requests instantly, then is throttled to 10/second. This is ideal for real-world clients that have bursty-but-bounded traffic.
interface TokenBucket {
tokens: number;
lastRefillTime: number;
}
class TokenBucketRateLimiter {
constructor(
private readonly redis: Redis,
private readonly capacity: number,
private readonly refillRatePerSecond: number
) {}
async isAllowed(clientId: string): Promise<{ allowed: boolean; tokensRemaining: number }> {
const key = `tokenbucket:${clientId}`;
const now = Date.now();
// Use Lua script for atomicity — read-modify-write must be atomic
const result = await this.redis.eval(`
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens to add since last refill
local elapsed_seconds = (now - last_refill) / 1000
local tokens_to_add = elapsed_seconds * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {allowed, math.floor(tokens)}
`, 1, key, now, this.capacity, this.refillRatePerSecond) as [number, number];
return {
allowed: result[0] === 1,
tokensRemaining: result[1]
};
}
}
Properties:
- Handles bursts (up to bucket capacity)
- Smooth long-term rate (limited by refill rate)
- O(1) space per client, O(1) time
- 1 Redis round-trip (Lua script atomic)
Used by: Stripe (they’ve written publicly about using token bucket), AWS API Gateway, many service meshes.
Algorithm 5: Leaky Bucket
A queue with a fixed-rate drain. Requests enter the queue; they exit (are processed) at a constant rate. If the queue is full, reject.
Queue (capacity 10): [req1][req2][req3]...
Drain rate: 5 requests/second
Requests processed at exactly 5/sec, regardless of arrival pattern.
Key difference from token bucket: Token bucket allows variable processing rate (bursting). Leaky bucket enforces a perfectly constant output rate. No bursting.
Use case: Rate limiting outgoing calls to a downstream service with strict rate limits (SMS gateway, payment processor). You want to smooth out your request pattern, not just cap it.
Tradeoff: Queued requests add latency. Full queue = dropped requests without retrying from queue drain.
Algorithm Comparison
| Algorithm | Burst Handling | Memory | Accuracy | Complexity |
|---|---|---|---|---|
| Fixed Window | Allows 2× at boundary | O(1) | Low | Very low |
| Sliding Window Log | Not allowed | O(n) requests | Exact | Medium |
| Sliding Window Counter | Approximate smooth | O(1) | ~99% | Low |
| Token Bucket | Allows bursts up to capacity | O(1) | Exact | Medium |
| Leaky Bucket | No burst, constant output | O(n) queue | Exact | Medium |
Real company choices:
- Stripe: Token bucket (public reference in their rate limiting blog post)
- Cloudflare: Sliding window counter (efficient at 20+ trillion requests/day)
- GitHub: Fixed window (simple, documented in API docs)
- Nginx: Leaky bucket (
limit_reqmodule uses leaky bucket) - AWS API Gateway: Token bucket
- Shopify: Leaky bucket (they’ve written about this publicly)
Distributed Rate Limiting with Redis
Single-server rate limiting is trivial. The hard problem is rate limiting across a fleet of API servers.
Naive Approach: Centralized Redis
All API servers talk to one Redis instance for rate limit state:
API Server 1 ──┐
API Server 2 ──┼──► Redis ──► rate limit state
API Server 3 ──┘
Works well for most use cases. Redis handles ~100K operations/second on a single node. At higher throughput, Redis can become a bottleneck.
Lua Scripts for Atomicity
Every rate limiting operation must be read-modify-write atomic. Non-atomic implementation:
// WRONG: Race condition
const count = await redis.get(key); // Read
if (count < limit) {
await redis.incr(key); // Write
return true; // Two API servers can both pass this check
}
Two API servers executing concurrently both read count = 9 for a limit = 10, both pass, both increment → actual count becomes 11. Rate limit violated.
Fix: Redis Lua scripts run atomically (single-threaded Redis execution):
-- Atomic INCR + check
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local count = redis.call('INCR', key)
if count == 1 then
redis.call('EXPIRE', key, window)
end
if count > limit then
return 0 -- rejected
else
return 1 -- allowed
end
Sliding Window with Redis Sorted Sets
For exact sliding window, use a Redis sorted set with timestamp as score:
async function slidingWindowCheck(
clientId: string,
limit: number,
windowMs: number
): Promise<boolean> {
const key = `sw:${clientId}`;
const now = Date.now();
const windowStart = now - windowMs;
// Pipeline: remove old, add current, count — executed server-side
const pipeline = redis.pipeline();
pipeline.zremrangebyscore(key, '-inf', windowStart);
pipeline.zadd(key, now, `${now}-${Math.random()}`); // unique member
pipeline.zcard(key);
pipeline.expire(key, Math.ceil(windowMs / 1000));
const results = await pipeline.exec();
const count = results[2][1] as number;
return count <= limit;
}
Note: Redis pipeline is not atomic (commands can be interleaved with other clients). For exact correctness, use Lua script. Pipelines reduce round-trips but don’t guarantee atomicity.
Local Cache + Periodic Sync (High Throughput)
At extreme scale (millions of RPS per service), even Redis becomes a bottleneck. Solution: synchronize rate limit state periodically rather than per-request.
class DistributedRateLimiter {
private localTokens = new Map<string, number>();
private syncInterval: ReturnType<typeof setInterval>;
constructor(
private redis: Redis,
private capacity: number,
private syncIntervalMs: number = 100
) {
// Sync local state with Redis every 100ms
this.syncInterval = setInterval(() => this.sync(), syncIntervalMs);
}
isAllowed(clientId: string): boolean {
const tokens = this.localTokens.get(clientId) ?? this.capacity;
if (tokens < 1) return false;
this.localTokens.set(clientId, tokens - 1);
return true;
}
private async sync(): Promise<void> {
// For each client tracked locally, subtract consumed tokens from Redis
for (const [clientId, localTokens] of this.localTokens.entries()) {
const consumed = this.capacity - localTokens;
if (consumed > 0) {
// Atomic: subtract consumed, get remaining
await redis.decrby(`tokens:${clientId}`, consumed);
this.localTokens.delete(clientId);
}
}
}
}
Trade-off: Allows small overages within the sync window (100ms × rate). For a 1000 req/s limit with 10 API servers and 100ms sync, worst case overage = 10 servers × 100ms × 1000/s = 1000 extra requests. Acceptable for most use cases.
Used by: Cloudflare’s internal rate limiting at edge (they do local counting + periodic sync to reduce inter-datacenter coordination).
Rate Limit Headers
Communicate limit state to clients so they can self-throttle:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 487
X-RateLimit-Reset: 1705315200 # Unix timestamp: when window resets
X-RateLimit-Policy: 1000;w=3600 # IETF draft: 1000 per 3600s window
# On 429:
HTTP/1.1 429 Too Many Requests
Retry-After: 47 # Seconds until client should retry
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705315200
IETF RateLimit Headers (draft-ietf-httpapi-ratelimit-headers):
RateLimit-Limit: 100
RateLimit-Remaining: 42
RateLimit-Reset: 30 # Seconds until reset (relative, not absolute)
This draft spec is gaining adoption (FastAPI has built-in support). Prefer this for new APIs.
429 response body should include machine-readable detail:
{
"type": "https://api.example.com/errors/rate-limit-exceeded",
"title": "Rate Limit Exceeded",
"status": 429,
"detail": "You have exceeded the 1000 requests/hour limit.",
"retryAfter": 47,
"limit": 1000,
"window": "1 hour"
}
Where to Enforce Rate Limits
Internet ──► [Load Balancer] ──► [API Gateway] ──► [App Service] ──► [Database]
L4 L7 App layer
Layer 4: Load Balancer (L4 — TCP/UDP)
Can enforce connection rate limits and bandwidth limits but cannot inspect HTTP content. Useful for blocking abusive IPs at the TCP level (SYN flood protection).
# nginx: limit connection rate from single IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
limit_conn conn_limit 20; # max 20 concurrent connections per IP
Layer 7: API Gateway (L7 — HTTP)
The right layer for request rate limiting. Can inspect HTTP headers, paths, and authentication context. Apply per-path, per-method, per-API-key limits here.
Benefits: Centralised enforcement; app services don’t need to implement it. Enforced before request reaches app logic (fail fast).
Application Layer
Necessary when rate limiting depends on business logic — e.g., “limit to 3 payment attempts per order” or “limit failed login attempts to 5 per account per 15 minutes.” The API gateway doesn’t have the business context to enforce these.
Rule: Enforce at the highest applicable layer. Only push to app layer when business logic is required.
Rate Limiting Granularity
| Granularity | Pros | Cons | Use Case |
|---|---|---|---|
| IP Address | No auth required, stops unauthenticated abuse | Shared IPs (NAT, corporate) can block many users | Unauthenticated endpoints, login rate limiting |
| API Key | Identifies client application | Single key shared among users | Public APIs, partner integrations |
| User Account | Fairest for multi-user clients | Requires authentication | Authenticated user actions |
| Endpoint | Protects expensive operations specifically | Complex to manage many rules | /search, /export, AI inference endpoints |
| Organisation/Tenant | Enforces business plan limits | Doesn’t protect against single rogue user | SaaS tier limits |
Layered approach (recommended):
Request comes in:
1. IP-level: max 10,000 req/min per IP (DDoS protection)
2. API key-level: max 1,000 req/min per key (abuse protection)
3. User-level: max 100 req/min per user account (fairness)
4. Endpoint-level: /search max 20 req/min (expensive resource)
All four layers can fire. Client gets the most restrictive limit that applies.
Handling Distributed State: The Race Condition Problem
Two API servers, one Redis, limit = 10 req/min:
Server A reads count = 9 ──────────────────────────────┐
Server B reads count = 9 ──────────────────┐ │
Server B increments to 10, returns allowed │ │
├──► Both passed!
Server A increments to 10, returns allowed │ Count is now 11
Solutions:
- Lua scripts (shown above) — atomic at Redis level
- Redis transactions (
MULTI/EXEC) — optimistic locking, retry on conflict - Redis modules (RedisCell) — implements GCRA (Generic Cell Rate Algorithm) natively in Redis C module, extremely fast
# RedisCell: CL.THROTTLE key max_burst count_per_period period [quantity]
CL.THROTTLE clientId 100 10 1 1
# ^ ^ ^ ^
# capacity| | | quantity to consume
# rate| period (seconds)
# (requests/period)
# Returns: [allowed(0=yes,1=no), total, remaining, retry_after_ms, reset_after_ms]
Rate Limiting Patterns for Specific Use Cases
Login Rate Limiting
// Different limits for different failure scenarios
async function checkLoginRateLimit(ip: string, username: string): Promise<void> {
// Limit by IP: prevents credential stuffing
const ipAllowed = await limiter.check(`login:ip:${ip}`, { limit: 20, window: '15m' });
// Limit by username: prevents targeted brute force
const userAllowed = await limiter.check(`login:user:${username}`, { limit: 5, window: '15m' });
if (!ipAllowed || !userAllowed) {
throw new RateLimitError('Too many login attempts. Try again in 15 minutes.');
}
}
// On successful login: reset the per-user counter
await redis.del(`login:user:${username}`);
Exponential Backoff with Jitter (Client Side)
async function retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries = 5
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (error instanceof RateLimitError && attempt < maxRetries) {
const retryAfter = error.retryAfter ?? 0;
// Exponential backoff with full jitter
const backoff = Math.min(
retryAfter * 1000 + Math.pow(2, attempt) * 1000,
30000 // max 30 seconds
);
const jitter = Math.random() * backoff;
await sleep(jitter);
continue;
}
throw error;
}
}
throw new Error('Max retries exceeded');
}
Full jitter (multiply by random) is preferred over additive jitter — it avoids thundering herd when many clients retry simultaneously after a rate limit reset.
Interview Checklist
Algorithm Fundamentals
- Explain the fixed window counter. What is the boundary burst problem?
- How does the sliding window log algorithm work? What is its memory complexity?
- How does the sliding window counter approximate the sliding window? What is the max error?
- Explain the token bucket algorithm. How does it handle bursts differently from leaky bucket?
- Which algorithm does Stripe use? Which does Cloudflare use?
Distributed Implementation
- Why must rate limit operations be atomic? Show the race condition
- How do Redis Lua scripts solve the atomicity problem?
- Design a rate limiter using Redis sorted sets for exact sliding window
- How would you implement rate limiting for 10M RPS without Redis becoming a bottleneck?
Architecture
- At what layer should you enforce rate limits? LB vs API gateway vs app?
- Compare rate limiting by IP vs API key vs user account
- How do you rate limit unauthenticated endpoints without API keys?
- Design rate limiting for a login endpoint to prevent both credential stuffing and brute force
Client-Side
- What headers should a rate-limited response include?
- Implement exponential backoff with jitter. Why is jitter necessary?
- A client receives
Retry-After: 60. What should it do?
Senior-Level Design
- Design the rate limiting system for Stripe’s public API (100+ req/s per key, millions of keys)
- How would you implement rate limiting at Cloudflare’s scale (25M requests/second)?
- How do you handle rate limiting when your service is behind a shared NAT?
- Design a rate limiter that supports different limits per pricing tier (Free: 100/h, Pro: 10k/h, Enterprise: unlimited)