System Design · Topic 8 of 18

Rate Limiting

125 XP

Why Rate Limiting Exists

Rate limiting is not just a defensive measure — it's a first-class systems design concern that touches availability, fairness, cost control, and security simultaneously.

The four reasons you rate limit:

  1. DDoS and abuse prevention — A single client sending 100,000 requests/second can saturate your backend. Even well-intentioned clients with buggy retry logic can create accidental DDoS.

  2. Cost control — Every request costs compute, database reads, third-party API calls (OpenAI, payment processors). A single runaway script can generate thousands of dollars in cloud bills overnight.

  3. Fairness — In a multi-tenant system, one power user consuming 90% of capacity degrades service for everyone else. Rate limits enforce equitable resource sharing.

  4. SLA protection — Downstream services (payment gateways, SMS providers) have their own rate limits. Your rate limiter protects you from propagating failures upstream.

Real numbers:

  • GitHub: 5,000 requests/hour per authenticated user, 60/hour unauthenticated
  • Stripe: 100 requests/second per secret key (burst up to 200)
  • Twitter API v2: varies by tier, 500k tweets/month at basic
  • Cloudflare: rate limiting rules configurable to any threshold, enforced at edge

Algorithm 1: Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (1 minute, 1 hour). Count requests in the current window. Reject if count exceeds limit.

Timeline (limit: 10 req/min):

Minute 1 [00:00 - 00:59]   |██████████░░| 10 requests — full
Minute 2 [01:00 - 01:59]   |░░░░░░░░░░░░| 0 requests — reset
async function isAllowed(clientId: string, limit: number): Promise<boolean> {
  const windowKey = `ratelimit:${clientId}:${Math.floor(Date.now() / 60000)}`;
  
  const count = await redis.incr(windowKey);
  
  if (count === 1) {
    // Set expiry only on first increment to avoid race condition
    await redis.expire(windowKey, 60);
  }
  
  return count <= limit;
}

The boundary burst problem (critical flaw):

Limit: 10 req/min

00:59 — client sends 10 requests  (window 1: count = 10, full)
01:00 — window resets
01:01 — client sends 10 more requests (window 2: count = 10)

In 2 seconds, the client sent 20 requests — 2x the limit
with zero rate limiting penalty.

This is the boundary burst: at every window boundary, an attacker can send 2× the limit. This is exploitable and means fixed window does not guarantee your stated limit.

When to use: Internal admin tools, low-stakes APIs, or when simplicity outweighs precision. Never use for security-critical rate limiting.

Space: O(1) per client (single counter). Time: O(1). Redis ops: 2 (INCR + EXPIRE on first request).


Algorithm 2: Sliding Window Log

Maintain a timestamped log of every request. On each request, remove entries older than the window, count remaining entries.

async function isAllowed(
  clientId: string, 
  limit: number, 
  windowMs: number
): Promise<boolean> {
  const now = Date.now();
  const windowStart = now - windowMs;
  const key = `ratelimit:log:${clientId}`;

  // Atomic Lua script: remove old entries, add current, count
  const count = await redis.eval(`
    local key = KEYS[1]
    local window_start = ARGV[1]
    local now = ARGV[2]
    local limit = tonumber(ARGV[3])
    local window_ms = tonumber(ARGV[4])
    
    -- Remove all entries older than window
    redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
    
    -- Add current request with score = timestamp
    redis.call('ZADD', key, now, now)
    
    -- Set expiry
    redis.call('PEXPIRE', key, window_ms)
    
    -- Count requests in window
    return redis.call('ZCARD', key)
  `, 1, key, windowStart, now, limit, windowMs) as number;

  return count <= limit;
}

Accurate but memory-expensive:

1 client, 1000 req/min limit, each entry = 16 bytes:
  Per client: 1000 × 16 bytes = 16 KB
  At 1M clients: 16 GB RAM just for rate limit state

Compare to fixed window:
  Per client: 8 bytes (single integer)
  At 1M clients: 8 MB

When to use: When accuracy is paramount and client count is bounded. Suitable for per-user limits where you have <10K active users. Not viable for high-cardinality public APIs.


Algorithm 3: Sliding Window Counter (Approximation)

The clever middle ground. Uses two fixed windows and weights them by overlap with the current sliding window.

Window size: 60 seconds
Current time: 01:15 (i.e., 15 seconds into minute 1)
Previous window (00:00–00:59): 8 requests
Current window (01:00–01:59): 3 requests

Weight of previous window = (60 - 15) / 60 = 0.75

Approximate count = 8 × 0.75 + 3 × 1.0
                  = 6 + 3 = 9 requests in sliding window
async function isAllowed(
  clientId: string,
  limit: number,
  windowMs: number
): Promise<boolean> {
  const now = Date.now();
  const windowStart = Math.floor(now / windowMs) * windowMs;
  const prevWindowStart = windowStart - windowMs;
  
  const currentKey = `ratelimit:${clientId}:${windowStart}`;
  const prevKey = `ratelimit:${clientId}:${prevWindowStart}`;

  const [currentCount, prevCount] = await redis.mget(currentKey, prevKey);
  
  const current = parseInt(currentCount ?? '0');
  const prev = parseInt(prevCount ?? '0');
  
  // How far into the current window are we?
  const elapsed = now - windowStart;
  const prevWeight = (windowMs - elapsed) / windowMs;
  
  const approximateCount = prev * prevWeight + current;
  
  if (approximateCount >= limit) {
    return false;
  }

  // Increment current window
  await redis.multi()
    .incr(currentKey)
    .expire(currentKey, Math.ceil(windowMs / 1000) * 2)
    .exec();

  return true;
}

Error bound: This approximation assumes requests in the previous window were uniformly distributed — which is usually true at scale. The maximum error is <1% of the limit in practice (Cloudflare found this acceptable for their use case). Memory cost: O(2) per client — just two counters.

Used by: Cloudflare's rate limiting, Redis's own rate limiting documentation.


Algorithm 4: Token Bucket

The most widely used algorithm in production systems. Conceptually: a bucket holds tokens. Each request consumes one token. Tokens refill at a fixed rate. If the bucket is empty, reject the request.

Bucket capacity: 100 tokens
Refill rate:     10 tokens/second

t=0:    bucket = 100  (full)
t=0:    burst of 100 requests → bucket = 0
t=1:    10 tokens added → bucket = 10
t=1:    10 requests → bucket = 0
t=5:    50 tokens added → bucket = 50

Key insight: Token bucket allows bursting up to the bucket capacity. A client can send 100 requests instantly, then is throttled to 10/second. This is ideal for real-world clients that have bursty-but-bounded traffic.

interface TokenBucket {
  tokens: number;
  lastRefillTime: number;
}

class TokenBucketRateLimiter {
  constructor(
    private readonly redis: Redis,
    private readonly capacity: number,
    private readonly refillRatePerSecond: number
  ) {}

  async isAllowed(clientId: string): Promise<{ allowed: boolean; tokensRemaining: number }> {
    const key = `tokenbucket:${clientId}`;
    const now = Date.now();

    // Use Lua script for atomicity — read-modify-write must be atomic
    const result = await this.redis.eval(`
      local key = KEYS[1]
      local now = tonumber(ARGV[1])
      local capacity = tonumber(ARGV[2])
      local refill_rate = tonumber(ARGV[3])
      
      local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(bucket[1]) or capacity
      local last_refill = tonumber(bucket[2]) or now
      
      -- Calculate tokens to add since last refill
      local elapsed_seconds = (now - last_refill) / 1000
      local tokens_to_add = elapsed_seconds * refill_rate
      tokens = math.min(capacity, tokens + tokens_to_add)
      
      local allowed = 0
      if tokens >= 1 then
        tokens = tokens - 1
        allowed = 1
      end
      
      redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
      redis.call('EXPIRE', key, 3600)
      
      return {allowed, math.floor(tokens)}
    `, 1, key, now, this.capacity, this.refillRatePerSecond) as [number, number];

    return {
      allowed: result[0] === 1,
      tokensRemaining: result[1]
    };
  }
}

Properties:

  • Handles bursts (up to bucket capacity)
  • Smooth long-term rate (limited by refill rate)
  • O(1) space per client, O(1) time
  • 1 Redis round-trip (Lua script atomic)

Used by: Stripe (they've written publicly about using token bucket), AWS API Gateway, many service meshes.


Algorithm 5: Leaky Bucket

A queue with a fixed-rate drain. Requests enter the queue; they exit (are processed) at a constant rate. If the queue is full, reject.

Queue (capacity 10):   [req1][req2][req3]...
Drain rate:            5 requests/second

Requests processed at exactly 5/sec, regardless of arrival pattern.

Key difference from token bucket: Token bucket allows variable processing rate (bursting). Leaky bucket enforces a perfectly constant output rate. No bursting.

Use case: Rate limiting outgoing calls to a downstream service with strict rate limits (SMS gateway, payment processor). You want to smooth out your request pattern, not just cap it.

Tradeoff: Queued requests add latency. Full queue = dropped requests without retrying from queue drain.


Algorithm Comparison

| Algorithm | Burst Handling | Memory | Accuracy | Complexity | |---|---|---|---|---| | Fixed Window | Allows 2× at boundary | O(1) | Low | Very low | | Sliding Window Log | Not allowed | O(n) requests | Exact | Medium | | Sliding Window Counter | Approximate smooth | O(1) | ~99% | Low | | Token Bucket | Allows bursts up to capacity | O(1) | Exact | Medium | | Leaky Bucket | No burst, constant output | O(n) queue | Exact | Medium |

Real company choices:

  • Stripe: Token bucket (public reference in their rate limiting blog post)
  • Cloudflare: Sliding window counter (efficient at 20+ trillion requests/day)
  • GitHub: Fixed window (simple, documented in API docs)
  • Nginx: Leaky bucket (limit_req module uses leaky bucket)
  • AWS API Gateway: Token bucket
  • Shopify: Leaky bucket (they've written about this publicly)

Distributed Rate Limiting with Redis

Single-server rate limiting is trivial. The hard problem is rate limiting across a fleet of API servers.

Naive Approach: Centralized Redis

All API servers talk to one Redis instance for rate limit state:

API Server 1 ──┐
API Server 2 ──┼──► Redis ──► rate limit state
API Server 3 ──┘

Works well for most use cases. Redis handles ~100K operations/second on a single node. At higher throughput, Redis can become a bottleneck.

Lua Scripts for Atomicity

Every rate limiting operation must be read-modify-write atomic. Non-atomic implementation:

// WRONG: Race condition
const count = await redis.get(key);          // Read
if (count < limit) {
  await redis.incr(key);                     // Write
  return true;                               // Two API servers can both pass this check
}

Two API servers executing concurrently both read count = 9 for a limit = 10, both pass, both increment → actual count becomes 11. Rate limit violated.

Fix: Redis Lua scripts run atomically (single-threaded Redis execution):

-- Atomic INCR + check
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local count = redis.call('INCR', key)
if count == 1 then
  redis.call('EXPIRE', key, window)
end

if count > limit then
  return 0  -- rejected
else
  return 1  -- allowed
end

Sliding Window with Redis Sorted Sets

For exact sliding window, use a Redis sorted set with timestamp as score:

async function slidingWindowCheck(
  clientId: string,
  limit: number,
  windowMs: number
): Promise<boolean> {
  const key = `sw:${clientId}`;
  const now = Date.now();
  const windowStart = now - windowMs;
  
  // Pipeline: remove old, add current, count — executed server-side
  const pipeline = redis.pipeline();
  pipeline.zremrangebyscore(key, '-inf', windowStart);
  pipeline.zadd(key, now, `${now}-${Math.random()}`);  // unique member
  pipeline.zcard(key);
  pipeline.expire(key, Math.ceil(windowMs / 1000));
  
  const results = await pipeline.exec();
  const count = results[2][1] as number;
  
  return count <= limit;
}

Note: Redis pipeline is not atomic (commands can be interleaved with other clients). For exact correctness, use Lua script. Pipelines reduce round-trips but don't guarantee atomicity.

Local Cache + Periodic Sync (High Throughput)

At extreme scale (millions of RPS per service), even Redis becomes a bottleneck. Solution: synchronize rate limit state periodically rather than per-request.

class DistributedRateLimiter {
  private localTokens = new Map<string, number>();
  private syncInterval: ReturnType<typeof setInterval>;

  constructor(
    private redis: Redis,
    private capacity: number,
    private syncIntervalMs: number = 100
  ) {
    // Sync local state with Redis every 100ms
    this.syncInterval = setInterval(() => this.sync(), syncIntervalMs);
  }

  isAllowed(clientId: string): boolean {
    const tokens = this.localTokens.get(clientId) ?? this.capacity;
    if (tokens < 1) return false;
    this.localTokens.set(clientId, tokens - 1);
    return true;
  }

  private async sync(): Promise<void> {
    // For each client tracked locally, subtract consumed tokens from Redis
    for (const [clientId, localTokens] of this.localTokens.entries()) {
      const consumed = this.capacity - localTokens;
      if (consumed > 0) {
        // Atomic: subtract consumed, get remaining
        await redis.decrby(`tokens:${clientId}`, consumed);
        this.localTokens.delete(clientId);
      }
    }
  }
}

Trade-off: Allows small overages within the sync window (100ms × rate). For a 1000 req/s limit with 10 API servers and 100ms sync, worst case overage = 10 servers × 100ms × 1000/s = 1000 extra requests. Acceptable for most use cases.

Used by: Cloudflare's internal rate limiting at edge (they do local counting + periodic sync to reduce inter-datacenter coordination).


Rate Limit Headers

Communicate limit state to clients so they can self-throttle:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 487
X-RateLimit-Reset: 1705315200     # Unix timestamp: when window resets
X-RateLimit-Policy: 1000;w=3600  # IETF draft: 1000 per 3600s window

# On 429:
HTTP/1.1 429 Too Many Requests
Retry-After: 47                   # Seconds until client should retry
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705315200

IETF RateLimit Headers (draft-ietf-httpapi-ratelimit-headers):

RateLimit-Limit: 100
RateLimit-Remaining: 42
RateLimit-Reset: 30       # Seconds until reset (relative, not absolute)

This draft spec is gaining adoption (FastAPI has built-in support). Prefer this for new APIs.

429 response body should include machine-readable detail:

{
  "type": "https://api.example.com/errors/rate-limit-exceeded",
  "title": "Rate Limit Exceeded",
  "status": 429,
  "detail": "You have exceeded the 1000 requests/hour limit.",
  "retryAfter": 47,
  "limit": 1000,
  "window": "1 hour"
}

Where to Enforce Rate Limits

Internet ──► [Load Balancer] ──► [API Gateway] ──► [App Service] ──► [Database]
                  L4                  L7               App layer

Layer 4: Load Balancer (L4 — TCP/UDP)

Can enforce connection rate limits and bandwidth limits but cannot inspect HTTP content. Useful for blocking abusive IPs at the TCP level (SYN flood protection).

# nginx: limit connection rate from single IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
limit_conn conn_limit 20;  # max 20 concurrent connections per IP

Layer 7: API Gateway (L7 — HTTP)

The right layer for request rate limiting. Can inspect HTTP headers, paths, and authentication context. Apply per-path, per-method, per-API-key limits here.

Benefits: Centralised enforcement; app services don't need to implement it. Enforced before request reaches app logic (fail fast).

Application Layer

Necessary when rate limiting depends on business logic — e.g., "limit to 3 payment attempts per order" or "limit failed login attempts to 5 per account per 15 minutes." The API gateway doesn't have the business context to enforce these.

Rule: Enforce at the highest applicable layer. Only push to app layer when business logic is required.


Rate Limiting Granularity

| Granularity | Pros | Cons | Use Case | |---|---|---|---| | IP Address | No auth required, stops unauthenticated abuse | Shared IPs (NAT, corporate) can block many users | Unauthenticated endpoints, login rate limiting | | API Key | Identifies client application | Single key shared among users | Public APIs, partner integrations | | User Account | Fairest for multi-user clients | Requires authentication | Authenticated user actions | | Endpoint | Protects expensive operations specifically | Complex to manage many rules | /search, /export, AI inference endpoints | | Organisation/Tenant | Enforces business plan limits | Doesn't protect against single rogue user | SaaS tier limits |

Layered approach (recommended):

Request comes in:
1. IP-level: max 10,000 req/min per IP           (DDoS protection)
2. API key-level: max 1,000 req/min per key      (abuse protection)
3. User-level: max 100 req/min per user account  (fairness)
4. Endpoint-level: /search max 20 req/min        (expensive resource)

All four layers can fire. Client gets the most restrictive limit that applies.


Handling Distributed State: The Race Condition Problem

Two API servers, one Redis, limit = 10 req/min:

Server A reads count = 9 ──────────────────────────────┐
Server B reads count = 9 ──────────────────┐            │
Server B increments to 10, returns allowed  │            │
                                            ├──►  Both passed!
Server A increments to 10, returns allowed  │    Count is now 11

Solutions:

  1. Lua scripts (shown above) — atomic at Redis level
  2. Redis transactions (MULTI/EXEC) — optimistic locking, retry on conflict
  3. Redis modules (RedisCell) — implements GCRA (Generic Cell Rate Algorithm) natively in Redis C module, extremely fast
# RedisCell: CL.THROTTLE key max_burst count_per_period period [quantity]
CL.THROTTLE clientId 100 10 1 1
#                      ^   ^  ^ ^
#              capacity|   |  | quantity to consume
#                   rate|  period (seconds)
#              (requests/period)

# Returns: [allowed(0=yes,1=no), total, remaining, retry_after_ms, reset_after_ms]

Rate Limiting Patterns for Specific Use Cases

Login Rate Limiting

// Different limits for different failure scenarios
async function checkLoginRateLimit(ip: string, username: string): Promise<void> {
  // Limit by IP: prevents credential stuffing
  const ipAllowed = await limiter.check(`login:ip:${ip}`, { limit: 20, window: '15m' });
  
  // Limit by username: prevents targeted brute force
  const userAllowed = await limiter.check(`login:user:${username}`, { limit: 5, window: '15m' });
  
  if (!ipAllowed || !userAllowed) {
    throw new RateLimitError('Too many login attempts. Try again in 15 minutes.');
  }
}

// On successful login: reset the per-user counter
await redis.del(`login:user:${username}`);

Exponential Backoff with Jitter (Client Side)

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error instanceof RateLimitError && attempt < maxRetries) {
        const retryAfter = error.retryAfter ?? 0;
        
        // Exponential backoff with full jitter
        const backoff = Math.min(
          retryAfter * 1000 + Math.pow(2, attempt) * 1000,
          30000 // max 30 seconds
        );
        const jitter = Math.random() * backoff;
        
        await sleep(jitter);
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

Full jitter (multiply by random) is preferred over additive jitter — it avoids thundering herd when many clients retry simultaneously after a rate limit reset.


Interview Checklist

Algorithm Fundamentals

  • [ ] Explain the fixed window counter. What is the boundary burst problem?
  • [ ] How does the sliding window log algorithm work? What is its memory complexity?
  • [ ] How does the sliding window counter approximate the sliding window? What is the max error?
  • [ ] Explain the token bucket algorithm. How does it handle bursts differently from leaky bucket?
  • [ ] Which algorithm does Stripe use? Which does Cloudflare use?

Distributed Implementation

  • [ ] Why must rate limit operations be atomic? Show the race condition
  • [ ] How do Redis Lua scripts solve the atomicity problem?
  • [ ] Design a rate limiter using Redis sorted sets for exact sliding window
  • [ ] How would you implement rate limiting for 10M RPS without Redis becoming a bottleneck?

Architecture

  • [ ] At what layer should you enforce rate limits? LB vs API gateway vs app?
  • [ ] Compare rate limiting by IP vs API key vs user account
  • [ ] How do you rate limit unauthenticated endpoints without API keys?
  • [ ] Design rate limiting for a login endpoint to prevent both credential stuffing and brute force

Client-Side

  • [ ] What headers should a rate-limited response include?
  • [ ] Implement exponential backoff with jitter. Why is jitter necessary?
  • [ ] A client receives Retry-After: 60. What should it do?

Senior-Level Design

  • [ ] Design the rate limiting system for Stripe's public API (100+ req/s per key, millions of keys)
  • [ ] How would you implement rate limiting at Cloudflare's scale (25M requests/second)?
  • [ ] How do you handle rate limiting when your service is behind a shared NAT?
  • [ ] Design a rate limiter that supports different limits per pricing tier (Free: 100/h, Pro: 10k/h, Enterprise: unlimited)