System Design · Topic 8 of 16

Rate Limiting

100 XP

Why Rate Limiting Exists

Rate limiting is not just a defensive measure — it’s a first-class systems design concern that touches availability, fairness, cost control, and security simultaneously.

The four reasons you rate limit:

  1. DDoS and abuse prevention — A single client sending 100,000 requests/second can saturate your backend. Even well-intentioned clients with buggy retry logic can create accidental DDoS.

  2. Cost control — Every request costs compute, database reads, third-party API calls (OpenAI, payment processors). A single runaway script can generate thousands of dollars in cloud bills overnight.

  3. Fairness — In a multi-tenant system, one power user consuming 90% of capacity degrades service for everyone else. Rate limits enforce equitable resource sharing.

  4. SLA protection — Downstream services (payment gateways, SMS providers) have their own rate limits. Your rate limiter protects you from propagating failures upstream.

Real numbers:

  • GitHub: 5,000 requests/hour per authenticated user, 60/hour unauthenticated
  • Stripe: 100 requests/second per secret key (burst up to 200)
  • Twitter API v2: varies by tier, 500k tweets/month at basic
  • Cloudflare: rate limiting rules configurable to any threshold, enforced at edge

Algorithm 1: Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (1 minute, 1 hour). Count requests in the current window. Reject if count exceeds limit.

Timeline (limit: 10 req/min):

Minute 1 [00:00 - 00:59]   |██████████░░| 10 requests — full
Minute 2 [01:00 - 01:59]   |░░░░░░░░░░░░| 0 requests — reset
async function isAllowed(clientId: string, limit: number): Promise<boolean> {
  const windowKey = `ratelimit:${clientId}:${Math.floor(Date.now() / 60000)}`;
  
  const count = await redis.incr(windowKey);
  
  if (count === 1) {
    // Set expiry only on first increment to avoid race condition
    await redis.expire(windowKey, 60);
  }
  
  return count <= limit;
}

The boundary burst problem (critical flaw):

Limit: 10 req/min

00:59 — client sends 10 requests  (window 1: count = 10, full)
01:00 — window resets
01:01 — client sends 10 more requests (window 2: count = 10)

In 2 seconds, the client sent 20 requests — 2x the limit
with zero rate limiting penalty.

This is the boundary burst: at every window boundary, an attacker can send 2× the limit. This is exploitable and means fixed window does not guarantee your stated limit.

When to use: Internal admin tools, low-stakes APIs, or when simplicity outweighs precision. Never use for security-critical rate limiting.

Space: O(1) per client (single counter). Time: O(1). Redis ops: 2 (INCR + EXPIRE on first request).


Algorithm 2: Sliding Window Log

Maintain a timestamped log of every request. On each request, remove entries older than the window, count remaining entries.

async function isAllowed(
  clientId: string, 
  limit: number, 
  windowMs: number
): Promise<boolean> {
  const now = Date.now();
  const windowStart = now - windowMs;
  const key = `ratelimit:log:${clientId}`;

  // Atomic Lua script: remove old entries, add current, count
  const count = await redis.eval(`
    local key = KEYS[1]
    local window_start = ARGV[1]
    local now = ARGV[2]
    local limit = tonumber(ARGV[3])
    local window_ms = tonumber(ARGV[4])
    
    -- Remove all entries older than window
    redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
    
    -- Add current request with score = timestamp
    redis.call('ZADD', key, now, now)
    
    -- Set expiry
    redis.call('PEXPIRE', key, window_ms)
    
    -- Count requests in window
    return redis.call('ZCARD', key)
  `, 1, key, windowStart, now, limit, windowMs) as number;

  return count <= limit;
}

Accurate but memory-expensive:

1 client, 1000 req/min limit, each entry = 16 bytes:
  Per client: 1000 × 16 bytes = 16 KB
  At 1M clients: 16 GB RAM just for rate limit state

Compare to fixed window:
  Per client: 8 bytes (single integer)
  At 1M clients: 8 MB

When to use: When accuracy is paramount and client count is bounded. Suitable for per-user limits where you have <10K active users. Not viable for high-cardinality public APIs.


Algorithm 3: Sliding Window Counter (Approximation)

The clever middle ground. Uses two fixed windows and weights them by overlap with the current sliding window.

Window size: 60 seconds
Current time: 01:15 (i.e., 15 seconds into minute 1)
Previous window (00:00–00:59): 8 requests
Current window (01:00–01:59): 3 requests

Weight of previous window = (60 - 15) / 60 = 0.75

Approximate count = 8 × 0.75 + 3 × 1.0
                  = 6 + 3 = 9 requests in sliding window
async function isAllowed(
  clientId: string,
  limit: number,
  windowMs: number
): Promise<boolean> {
  const now = Date.now();
  const windowStart = Math.floor(now / windowMs) * windowMs;
  const prevWindowStart = windowStart - windowMs;
  
  const currentKey = `ratelimit:${clientId}:${windowStart}`;
  const prevKey = `ratelimit:${clientId}:${prevWindowStart}`;

  const [currentCount, prevCount] = await redis.mget(currentKey, prevKey);
  
  const current = parseInt(currentCount ?? '0');
  const prev = parseInt(prevCount ?? '0');
  
  // How far into the current window are we?
  const elapsed = now - windowStart;
  const prevWeight = (windowMs - elapsed) / windowMs;
  
  const approximateCount = prev * prevWeight + current;
  
  if (approximateCount >= limit) {
    return false;
  }

  // Increment current window
  await redis.multi()
    .incr(currentKey)
    .expire(currentKey, Math.ceil(windowMs / 1000) * 2)
    .exec();

  return true;
}

Error bound: This approximation assumes requests in the previous window were uniformly distributed — which is usually true at scale. The maximum error is <1% of the limit in practice (Cloudflare found this acceptable for their use case). Memory cost: O(2) per client — just two counters.

Used by: Cloudflare’s rate limiting, Redis’s own rate limiting documentation.


Algorithm 4: Token Bucket

The most widely used algorithm in production systems. Conceptually: a bucket holds tokens. Each request consumes one token. Tokens refill at a fixed rate. If the bucket is empty, reject the request.

Bucket capacity: 100 tokens
Refill rate:     10 tokens/second

t=0:    bucket = 100  (full)
t=0:    burst of 100 requests → bucket = 0
t=1:    10 tokens added → bucket = 10
t=1:    10 requests → bucket = 0
t=5:    50 tokens added → bucket = 50

Key insight: Token bucket allows bursting up to the bucket capacity. A client can send 100 requests instantly, then is throttled to 10/second. This is ideal for real-world clients that have bursty-but-bounded traffic.

interface TokenBucket {
  tokens: number;
  lastRefillTime: number;
}

class TokenBucketRateLimiter {
  constructor(
    private readonly redis: Redis,
    private readonly capacity: number,
    private readonly refillRatePerSecond: number
  ) {}

  async isAllowed(clientId: string): Promise<{ allowed: boolean; tokensRemaining: number }> {
    const key = `tokenbucket:${clientId}`;
    const now = Date.now();

    // Use Lua script for atomicity — read-modify-write must be atomic
    const result = await this.redis.eval(`
      local key = KEYS[1]
      local now = tonumber(ARGV[1])
      local capacity = tonumber(ARGV[2])
      local refill_rate = tonumber(ARGV[3])
      
      local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(bucket[1]) or capacity
      local last_refill = tonumber(bucket[2]) or now
      
      -- Calculate tokens to add since last refill
      local elapsed_seconds = (now - last_refill) / 1000
      local tokens_to_add = elapsed_seconds * refill_rate
      tokens = math.min(capacity, tokens + tokens_to_add)
      
      local allowed = 0
      if tokens >= 1 then
        tokens = tokens - 1
        allowed = 1
      end
      
      redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
      redis.call('EXPIRE', key, 3600)
      
      return {allowed, math.floor(tokens)}
    `, 1, key, now, this.capacity, this.refillRatePerSecond) as [number, number];

    return {
      allowed: result[0] === 1,
      tokensRemaining: result[1]
    };
  }
}

Properties:

  • Handles bursts (up to bucket capacity)
  • Smooth long-term rate (limited by refill rate)
  • O(1) space per client, O(1) time
  • 1 Redis round-trip (Lua script atomic)

Used by: Stripe (they’ve written publicly about using token bucket), AWS API Gateway, many service meshes.


Algorithm 5: Leaky Bucket

A queue with a fixed-rate drain. Requests enter the queue; they exit (are processed) at a constant rate. If the queue is full, reject.

Queue (capacity 10):   [req1][req2][req3]...
Drain rate:            5 requests/second

Requests processed at exactly 5/sec, regardless of arrival pattern.

Key difference from token bucket: Token bucket allows variable processing rate (bursting). Leaky bucket enforces a perfectly constant output rate. No bursting.

Use case: Rate limiting outgoing calls to a downstream service with strict rate limits (SMS gateway, payment processor). You want to smooth out your request pattern, not just cap it.

Tradeoff: Queued requests add latency. Full queue = dropped requests without retrying from queue drain.


Algorithm Comparison

AlgorithmBurst HandlingMemoryAccuracyComplexity
Fixed WindowAllows 2× at boundaryO(1)LowVery low
Sliding Window LogNot allowedO(n) requestsExactMedium
Sliding Window CounterApproximate smoothO(1)~99%Low
Token BucketAllows bursts up to capacityO(1)ExactMedium
Leaky BucketNo burst, constant outputO(n) queueExactMedium

Real company choices:

  • Stripe: Token bucket (public reference in their rate limiting blog post)
  • Cloudflare: Sliding window counter (efficient at 20+ trillion requests/day)
  • GitHub: Fixed window (simple, documented in API docs)
  • Nginx: Leaky bucket (limit_req module uses leaky bucket)
  • AWS API Gateway: Token bucket
  • Shopify: Leaky bucket (they’ve written about this publicly)

Distributed Rate Limiting with Redis

Single-server rate limiting is trivial. The hard problem is rate limiting across a fleet of API servers.

Naive Approach: Centralized Redis

All API servers talk to one Redis instance for rate limit state:

API Server 1 ──┐
API Server 2 ──┼──► Redis ──► rate limit state
API Server 3 ──┘

Works well for most use cases. Redis handles ~100K operations/second on a single node. At higher throughput, Redis can become a bottleneck.

Lua Scripts for Atomicity

Every rate limiting operation must be read-modify-write atomic. Non-atomic implementation:

// WRONG: Race condition
const count = await redis.get(key);          // Read
if (count < limit) {
  await redis.incr(key);                     // Write
  return true;                               // Two API servers can both pass this check
}

Two API servers executing concurrently both read count = 9 for a limit = 10, both pass, both increment → actual count becomes 11. Rate limit violated.

Fix: Redis Lua scripts run atomically (single-threaded Redis execution):

-- Atomic INCR + check
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local count = redis.call('INCR', key)
if count == 1 then
  redis.call('EXPIRE', key, window)
end

if count > limit then
  return 0  -- rejected
else
  return 1  -- allowed
end

Sliding Window with Redis Sorted Sets

For exact sliding window, use a Redis sorted set with timestamp as score:

async function slidingWindowCheck(
  clientId: string,
  limit: number,
  windowMs: number
): Promise<boolean> {
  const key = `sw:${clientId}`;
  const now = Date.now();
  const windowStart = now - windowMs;
  
  // Pipeline: remove old, add current, count — executed server-side
  const pipeline = redis.pipeline();
  pipeline.zremrangebyscore(key, '-inf', windowStart);
  pipeline.zadd(key, now, `${now}-${Math.random()}`);  // unique member
  pipeline.zcard(key);
  pipeline.expire(key, Math.ceil(windowMs / 1000));
  
  const results = await pipeline.exec();
  const count = results[2][1] as number;
  
  return count <= limit;
}

Note: Redis pipeline is not atomic (commands can be interleaved with other clients). For exact correctness, use Lua script. Pipelines reduce round-trips but don’t guarantee atomicity.

Local Cache + Periodic Sync (High Throughput)

At extreme scale (millions of RPS per service), even Redis becomes a bottleneck. Solution: synchronize rate limit state periodically rather than per-request.

class DistributedRateLimiter {
  private localTokens = new Map<string, number>();
  private syncInterval: ReturnType<typeof setInterval>;

  constructor(
    private redis: Redis,
    private capacity: number,
    private syncIntervalMs: number = 100
  ) {
    // Sync local state with Redis every 100ms
    this.syncInterval = setInterval(() => this.sync(), syncIntervalMs);
  }

  isAllowed(clientId: string): boolean {
    const tokens = this.localTokens.get(clientId) ?? this.capacity;
    if (tokens < 1) return false;
    this.localTokens.set(clientId, tokens - 1);
    return true;
  }

  private async sync(): Promise<void> {
    // For each client tracked locally, subtract consumed tokens from Redis
    for (const [clientId, localTokens] of this.localTokens.entries()) {
      const consumed = this.capacity - localTokens;
      if (consumed > 0) {
        // Atomic: subtract consumed, get remaining
        await redis.decrby(`tokens:${clientId}`, consumed);
        this.localTokens.delete(clientId);
      }
    }
  }
}

Trade-off: Allows small overages within the sync window (100ms × rate). For a 1000 req/s limit with 10 API servers and 100ms sync, worst case overage = 10 servers × 100ms × 1000/s = 1000 extra requests. Acceptable for most use cases.

Used by: Cloudflare’s internal rate limiting at edge (they do local counting + periodic sync to reduce inter-datacenter coordination).


Rate Limit Headers

Communicate limit state to clients so they can self-throttle:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 487
X-RateLimit-Reset: 1705315200     # Unix timestamp: when window resets
X-RateLimit-Policy: 1000;w=3600  # IETF draft: 1000 per 3600s window

# On 429:
HTTP/1.1 429 Too Many Requests
Retry-After: 47                   # Seconds until client should retry
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705315200

IETF RateLimit Headers (draft-ietf-httpapi-ratelimit-headers):

RateLimit-Limit: 100
RateLimit-Remaining: 42
RateLimit-Reset: 30       # Seconds until reset (relative, not absolute)

This draft spec is gaining adoption (FastAPI has built-in support). Prefer this for new APIs.

429 response body should include machine-readable detail:

{
  "type": "https://api.example.com/errors/rate-limit-exceeded",
  "title": "Rate Limit Exceeded",
  "status": 429,
  "detail": "You have exceeded the 1000 requests/hour limit.",
  "retryAfter": 47,
  "limit": 1000,
  "window": "1 hour"
}

Where to Enforce Rate Limits

Internet ──► [Load Balancer] ──► [API Gateway] ──► [App Service] ──► [Database]
                  L4                  L7               App layer

Layer 4: Load Balancer (L4 — TCP/UDP)

Can enforce connection rate limits and bandwidth limits but cannot inspect HTTP content. Useful for blocking abusive IPs at the TCP level (SYN flood protection).

# nginx: limit connection rate from single IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
limit_conn conn_limit 20;  # max 20 concurrent connections per IP

Layer 7: API Gateway (L7 — HTTP)

The right layer for request rate limiting. Can inspect HTTP headers, paths, and authentication context. Apply per-path, per-method, per-API-key limits here.

Benefits: Centralised enforcement; app services don’t need to implement it. Enforced before request reaches app logic (fail fast).

Application Layer

Necessary when rate limiting depends on business logic — e.g., “limit to 3 payment attempts per order” or “limit failed login attempts to 5 per account per 15 minutes.” The API gateway doesn’t have the business context to enforce these.

Rule: Enforce at the highest applicable layer. Only push to app layer when business logic is required.


Rate Limiting Granularity

GranularityProsConsUse Case
IP AddressNo auth required, stops unauthenticated abuseShared IPs (NAT, corporate) can block many usersUnauthenticated endpoints, login rate limiting
API KeyIdentifies client applicationSingle key shared among usersPublic APIs, partner integrations
User AccountFairest for multi-user clientsRequires authenticationAuthenticated user actions
EndpointProtects expensive operations specificallyComplex to manage many rules/search, /export, AI inference endpoints
Organisation/TenantEnforces business plan limitsDoesn’t protect against single rogue userSaaS tier limits

Layered approach (recommended):

Request comes in:
1. IP-level: max 10,000 req/min per IP           (DDoS protection)
2. API key-level: max 1,000 req/min per key      (abuse protection)
3. User-level: max 100 req/min per user account  (fairness)
4. Endpoint-level: /search max 20 req/min        (expensive resource)

All four layers can fire. Client gets the most restrictive limit that applies.


Handling Distributed State: The Race Condition Problem

Two API servers, one Redis, limit = 10 req/min:

Server A reads count = 9 ──────────────────────────────┐
Server B reads count = 9 ──────────────────┐            │
Server B increments to 10, returns allowed  │            │
                                            ├──►  Both passed!
Server A increments to 10, returns allowed  │    Count is now 11

Solutions:

  1. Lua scripts (shown above) — atomic at Redis level
  2. Redis transactions (MULTI/EXEC) — optimistic locking, retry on conflict
  3. Redis modules (RedisCell) — implements GCRA (Generic Cell Rate Algorithm) natively in Redis C module, extremely fast
# RedisCell: CL.THROTTLE key max_burst count_per_period period [quantity]
CL.THROTTLE clientId 100 10 1 1
#                      ^   ^  ^ ^
#              capacity|   |  | quantity to consume
#                   rate|  period (seconds)
#              (requests/period)

# Returns: [allowed(0=yes,1=no), total, remaining, retry_after_ms, reset_after_ms]

Rate Limiting Patterns for Specific Use Cases

Login Rate Limiting

// Different limits for different failure scenarios
async function checkLoginRateLimit(ip: string, username: string): Promise<void> {
  // Limit by IP: prevents credential stuffing
  const ipAllowed = await limiter.check(`login:ip:${ip}`, { limit: 20, window: '15m' });
  
  // Limit by username: prevents targeted brute force
  const userAllowed = await limiter.check(`login:user:${username}`, { limit: 5, window: '15m' });
  
  if (!ipAllowed || !userAllowed) {
    throw new RateLimitError('Too many login attempts. Try again in 15 minutes.');
  }
}

// On successful login: reset the per-user counter
await redis.del(`login:user:${username}`);

Exponential Backoff with Jitter (Client Side)

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error instanceof RateLimitError && attempt < maxRetries) {
        const retryAfter = error.retryAfter ?? 0;
        
        // Exponential backoff with full jitter
        const backoff = Math.min(
          retryAfter * 1000 + Math.pow(2, attempt) * 1000,
          30000 // max 30 seconds
        );
        const jitter = Math.random() * backoff;
        
        await sleep(jitter);
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

Full jitter (multiply by random) is preferred over additive jitter — it avoids thundering herd when many clients retry simultaneously after a rate limit reset.


Interview Checklist

Algorithm Fundamentals

  • Explain the fixed window counter. What is the boundary burst problem?
  • How does the sliding window log algorithm work? What is its memory complexity?
  • How does the sliding window counter approximate the sliding window? What is the max error?
  • Explain the token bucket algorithm. How does it handle bursts differently from leaky bucket?
  • Which algorithm does Stripe use? Which does Cloudflare use?

Distributed Implementation

  • Why must rate limit operations be atomic? Show the race condition
  • How do Redis Lua scripts solve the atomicity problem?
  • Design a rate limiter using Redis sorted sets for exact sliding window
  • How would you implement rate limiting for 10M RPS without Redis becoming a bottleneck?

Architecture

  • At what layer should you enforce rate limits? LB vs API gateway vs app?
  • Compare rate limiting by IP vs API key vs user account
  • How do you rate limit unauthenticated endpoints without API keys?
  • Design rate limiting for a login endpoint to prevent both credential stuffing and brute force

Client-Side

  • What headers should a rate-limited response include?
  • Implement exponential backoff with jitter. Why is jitter necessary?
  • A client receives Retry-After: 60. What should it do?

Senior-Level Design

  • Design the rate limiting system for Stripe’s public API (100+ req/s per key, millions of keys)
  • How would you implement rate limiting at Cloudflare’s scale (25M requests/second)?
  • How do you handle rate limiting when your service is behind a shared NAT?
  • Design a rate limiter that supports different limits per pricing tier (Free: 100/h, Pro: 10k/h, Enterprise: unlimited)