System Design · Topic 17 of 18

Design: Real-Time Chess Platform

200 XP

You're designing a real-time multiplayer chess platform at lichess.org / chess.com scale. The site already has a single-player browser game (minimax AI). This design covers scaling that into a full multiplayer platform: matchmaking, live games, spectators, ELO rating, game history, and engine integration — all at 1M concurrent games.

This is a classic FAANG system design interview question. The hard parts are stateful WebSocket routing, server-authoritative game clocks, Stockfish process pooling, and consistent game recovery after crashes.


1. Requirements Clarification

Nail down scope before drawing boxes. Every interviewer has a different idea of what "chess platform" means.

Functional requirements

| Feature | Notes | |---|---| | 1v1 live games | Full chess rules, all time controls | | AI opponent | Minimax (browser) or Stockfish (server), multiple difficulty levels | | Matchmaking | Pair by ELO ± window, by time control | | Spectators | Watch any live game, low latency | | Game history & replay | Move-by-move replay, opening classification | | ELO rating | Per time control (bullet/blitz/rapid/classical), K-factor by rating band | | Tournaments | Swiss/round-robin brackets, arena (stretch) |

Non-functional requirements

| Metric | Target | |---|---| | Concurrent games | 1M | | Move latency | <50ms p99 globally | | DAU | 10M | | Uptime for active games | 99.99% (52 min downtime/year) | | Game history durability | 99.9999% — never lose a completed game |

Out of scope (say this explicitly in interviews)

  • Video/voice chat
  • Puzzles, lessons, analysis board
  • Mobile apps (assume web clients)
  • Payment / subscriptions

2. Capacity Estimation

Do the math on the whiteboard. It justifies every architectural decision that follows.

Connections & traffic

Active games:         1,000,000 concurrent
WebSocket connections: 2M players + ~500K spectators = ~2.5M WS connections

Move rate:
  1M games × avg 1 move per 5s = 200,000 moves/sec

Peak (tournament burst):
  3× baseline = 600,000 moves/sec

Storage

Moves per game:    ~40 moves average (80 half-moves / plies)
Games per day:     1M games/day (10M DAU × ~0.1 games/session)

Raw move storage:
  40 moves × 100 bytes/move × 1M games = 4 GB/day

PGN storage (compressed):
  1M games × 5 KB/game = 5 GB/day → ~15 GB/day with metadata

After 1 year:        ~5 TB raw game data
With 3× replication: ~15 TB
Cold storage (S3):   Compress 10:1 → 1.5 TB/year

ELO updates

1M games/day × 2 players = 2M ELO updates/day
= ~23 updates/sec — trivially handled by any RDBMS

Game servers

Each game server: 32 vCPUs, 128 GB RAM
In-memory game state: ~50 KB per active game (FEN + move history + clocks)
Games per server: 128 GB × 0.6 / 50 KB ≈ 1,536,000 → cap at 5,000 (CPU bound)
Servers needed: 1M / 5K = 200 game servers
  + 50% headroom = 300 game servers

3. Chess Engine Integration

The existing Wizard's Chess uses client-side minimax. Here's the full engine strategy at platform scale.

Three-tier engine architecture

Tier 1: Browser minimax (existing)
  - TypeScript, alpha-beta pruning, depth 4-6
  - Zero server cost, works offline
  - Strength: ~1000 ELO equivalent
  - Use for: "play vs computer" casual mode

Tier 2: WASM Stockfish (browser)
  - stockfish.wasm bundled with the app (~6 MB)
  - Runs in Web Worker, SharedArrayBuffer for comms
  - Depth 12-15, strength ~2200 ELO
  - Use for: analysis board, "easy/medium" AI games
  - How lichess does it: stockfish.js + WASM SIMD where available

Tier 3: Server-side Stockfish (C++ subprocess)
  - Depth 18-22, strength 3200+ ELO
  - UCI protocol over stdin/stdout
  - Use for: "hard AI" games, anti-cheat analysis

Server-side Stockfish pool (Rust/tokio)

The critical insight: never spawn a new Stockfish process per request. Stockfish startup is ~500ms. Instead, maintain a warm pool and reuse processes across games.

// Engine pool — one pool per difficulty level
struct EnginePool {
    engines: Vec<Arc<Mutex<StockfishProcess>>>,
    queue: Arc<Mutex<VecDeque<MoveRequest>>>,
}

impl EnginePool {
    async fn request_move(&self, fen: &str, depth: u8, timeout_ms: u64) -> String {
        // Acquire an idle engine from the pool
        let engine = self.acquire_engine().await;
        engine.send_command(&format!("position fen {}", fen)).await;
        engine.send_command(&format!("go depth {}", depth)).await;

        // Parse "bestmove e2e4 ponder e7e5" from UCI output
        let bestmove = engine.read_until_bestmove(timeout_ms).await;
        self.release_engine(engine).await;
        bestmove
    }
}
Pool sizing:
  "Hard AI" games: 1 Stockfish per 50 concurrent AI games (Stockfish thinks ~2s/move)
  1M AI games × 10% = 100K AI games
  Pool size: 100K / 50 = 2,000 Stockfish processes across all game servers
  Per server (300 servers): ~7 Stockfish processes
  RAM: 7 × 200 MB = 1.4 GB/server — acceptable

WASM Stockfish setup (browser)

// Spawn Stockfish in a Web Worker
const worker = new Worker('/stockfish.js')

function requestBestMove(fen: string, depth: number): Promise<string> {
  return new Promise((resolve) => {
    worker.postMessage(`position fen ${fen}`)
    worker.postMessage(`go depth ${depth}`)

    worker.onmessage = (e: MessageEvent<string>) => {
      if (e.data.startsWith('bestmove')) {
        // "bestmove e2e4 ponder e7e5"
        resolve(e.data.split(' ')[1])
      }
    }
  })
}

4. Real-Time Move Protocol

Transport choice: WebSocket wins

| Protocol | Latency | Overhead | Bidirectional | Verdict | |---|---|---|---|---| | HTTP Polling | High (poll interval) | Very high (full HTTP headers) | No | ❌ | | SSE | Low | Low | Server → client only | ❌ | | WebSocket | Lowest | Minimal (2-14 byte frames) | Yes | ✅ | | WebRTC | Lowest | High (STUN/TURN setup) | Yes | ❌ (P2P unreliable) |

At 200K moves/sec, HTTP polling would require ~10× more servers just for connection overhead.

Move message format

// Client → Server
interface MoveMessage {
  type: 'move'
  gameId: string
  seq: number        // monotonically increasing per game, prevents ghost moves
  from: string       // "e2"
  to: string         // "e4"
  promotion: 'q' | 'r' | 'b' | 'n' | null
  clientTs: number   // client timestamp for clock sync diagnostics
}

// Server → Client (broadcast to both players + spectators)
interface MoveBroadcast {
  type: 'move'
  gameId: string
  seq: number
  from: string
  to: string
  promotion: 'q' | 'r' | 'b' | 'n' | null
  fen: string        // authoritative FEN after move
  whiteClock: number // ms remaining
  blackClock: number // ms remaining
  serverTs: number   // server timestamp (authoritative)
}

// Server → Client (on illegal move)
interface MoveRejected {
  type: 'move_rejected'
  gameId: string
  seq: number
  reason: 'illegal_move' | 'not_your_turn' | 'game_over' | 'clock_expired'
  fen: string        // reset client to this FEN
}

// Server → Client (game over)
interface GameOver {
  type: 'game_over'
  gameId: string
  result: '1-0' | '0-1' | '1/2-1/2'
  reason: 'checkmate' | 'resignation' | 'timeout' | 'draw_agreement' | 'stalemate' | 'insufficient_material'
  eloChange: { white: number; black: number }
}

Server-side move validation — never trust the client

import { Chess } from 'chess.js'  // or equivalent Rust chess library

function validateAndApplyMove(
  game: GameState,
  msg: MoveMessage
): MoveBroadcast | MoveRejected {
  // 1. Sequence check — reject duplicate or out-of-order moves
  if (msg.seq !== game.expectedSeq) {
    return { type: 'move_rejected', reason: 'illegal_move', ...}
  }

  // 2. Turn check
  const isWhiteTurn = game.chess.turn() === 'w'
  if (isWhiteTurn && msg.playerId !== game.whitePlayerId) {
    return { type: 'move_rejected', reason: 'not_your_turn', ...}
  }

  // 3. Clock check — server is authoritative
  const elapsed = Date.now() - game.lastMoveTs
  if (elapsed > game.activeClock + CLOCK_GRACE_MS) {
    return { type: 'move_rejected', reason: 'clock_expired', ...}
  }

  // 4. Legality check via chess rules engine
  const result = game.chess.move({ from: msg.from, to: msg.to, promotion: msg.promotion })
  if (!result) {
    return { type: 'move_rejected', reason: 'illegal_move', ...}
  }

  // 5. Update clocks
  game.activeClock -= elapsed
  game.activeClock += game.increment
  game.lastMoveTs = Date.now()
  game.expectedSeq++

  return {
    type: 'move',
    fen: game.chess.fen(),
    whiteClock: game.whiteClock,
    blackClock: game.blackClock,
    ...
  }
}

Optimistic UI with rollback

// Client-side: render move immediately, revert if rejected
function handleLocalMove(from: string, to: string) {
  const savedFen = chessEngine.fen()

  // Optimistic: show move immediately
  chessEngine.move({ from, to })
  renderBoard(chessEngine.fen())

  // Send to server
  ws.send(JSON.stringify({ type: 'move', from, to, seq: localSeq++ }))

  // Set rollback timer — if no server ACK in 3s, something is wrong
  const rollbackTimer = setTimeout(() => {
    chessEngine.load(savedFen)
    renderBoard(savedFen)
  }, 3000)

  pendingMoves.set(localSeq, { rollbackTimer, savedFen })
}

// Server confirms or rejects
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data)

  if (msg.type === 'move') {
    clearTimeout(pendingMoves.get(msg.seq)?.rollbackTimer)
    // Load authoritative FEN — corrects any drift
    chessEngine.load(msg.fen)
    updateClocks(msg.whiteClock, msg.blackClock)
  }

  if (msg.type === 'move_rejected') {
    const pending = pendingMoves.get(msg.seq)
    clearTimeout(pending?.rollbackTimer)
    chessEngine.load(msg.fen)  // reset to server state
    renderBoard(msg.fen)
  }
}

5. Stateful Game Server Architecture

The core problem: WebSockets are stateful

HTTP is stateless — any server can handle any request. WebSocket connections are long-lived and hold game state in memory. You can't freely round-robin them. A move from player A and player B for the same game must hit the same server process.

Solution: Sticky routing via consistent hashing

gameId → hash → game server index

gameId "abc123" → SHA256 → mod 300 = server #47
All WebSocket connections for game "abc123" route to server #47

The WebSocket gateway (nginx/Envoy) reads the gameId from the URL path (/ws/game/abc123) and uses a consistent hash upstream to select the game server. This is stateless at the gateway layer — no session affinity cookies needed.

In-memory game state

interface GameState {
  gameId: string
  whitePlayerId: string
  blackPlayerId: string
  chess: Chess                   // chess.js instance — current position
  moveHistory: MoveRecord[]
  whiteClock: number             // ms remaining
  blackClock: number             // ms remaining
  increment: number              // ms added per move
  lastMoveTs: number             // epoch ms of last move (for clock deduction)
  expectedSeq: number
  spectatorSockets: Set<WebSocket>
  whiteSocket: WebSocket | null  // null if disconnected
  blackSocket: WebSocket | null
  status: 'waiting' | 'active' | 'finished'
}

// 300 servers × 5,000 games × 50 KB = 75 GB total in-memory game state
// Fits comfortably in 128 GB RAM per server with headroom

Write-ahead log: every move goes to Kafka

Player move arrives
  → Server validates move
  → Update in-memory GameState
  → Broadcast to players + spectators (fast path, ~1ms)
  → Async write to Kafka topic "game-moves" (non-blocking)
     Key: gameId (guarantees ordering per game)
     Value: { gameId, seq, from, to, fen_after, clocks, ts }

The game server never reads from the database during active play. The hot path is 100% in-memory. Kafka is the durable record.

Game server crash recovery

1. Game server #47 crashes
2. Load balancer detects health check failure
3. Consistent hash re-routes "abc123" to server #48
4. Both players reconnect (WS reconnect logic in client)
5. Server #48 has no state for game "abc123"
6. Server #48 reads Kafka partition for gameId "abc123"
   → Replays all moves in order → reconstructs GameState
7. Players receive "reconnected" event, game resumes
8. Total downtime: ~3-5 seconds (Kafka replay time for ~40 moves)
async function recoverGame(gameId: string): Promise<GameState> {
  // Read all moves for this game from Kafka (low-watermark to high-watermark)
  const moves = await kafkaConsumer.getMessagesForKey('game-moves', gameId)

  const chess = new Chess()
  let whiteClock = initialTime
  let blackClock = initialTime

  for (const move of moves) {
    chess.move({ from: move.from, to: move.to, promotion: move.promotion })
    whiteClock = move.whiteClock
    blackClock = move.blackClock
  }

  return { gameId, chess, whiteClock, blackClock, expectedSeq: moves.length, ... }
}

6. Matchmaking System

Requirements

  • Match players within ±100 ELO (expand window every 5s to prevent waiting forever)
  • Match by time control: bullet (1+0), blitz (3+2), rapid (10+0), classical (30+0)
  • Prevent extreme mismatches (cap at ±400 ELO regardless of wait time)

Redis sorted set per queue

Key:   "queue:blitz"
Score: ELO rating (numeric — enables ZRANGEBYSCORE)
Value: "playerId:userId123:ts:1718000000:elo:1450"
// Player joins matchmaking
async function joinQueue(userId: string, timeControl: string, elo: number) {
  const member = `${userId}:${Date.now()}:${elo}`
  await redis.zadd(`queue:${timeControl}`, elo, member)
  await redis.expire(`queue:${timeControl}:${userId}`, 120)  // auto-expire after 2 min
}

// Matchmaking worker (runs every 500ms)
async function matchmakingTick(timeControl: string) {
  // Pull all queued players sorted by ELO
  const players = await redis.zrange(`queue:${timeControl}`, 0, -1, 'WITHSCORES')

  for (let i = 0; i < players.length - 1; i++) {
    const [playerA, eloA] = players[i]
    const [playerB, eloB] = players[i + 1]

    const eloDiff = Math.abs(eloA - eloB)
    const waitSeconds = (Date.now() - parseTs(playerA)) / 1000

    // ELO window expands: ±100 base + 20 per 5s waited, capped at ±400
    const maxDiff = Math.min(100 + 20 * Math.floor(waitSeconds / 5), 400)

    if (eloDiff <= maxDiff) {
      // Create game
      await redis.zrem(`queue:${timeControl}`, playerA, playerB)
      await createGame(playerA, playerB, timeControl)
    }
  }
}

Game creation flow

1. Matchmaking worker pairs A + B
2. Calls Game Service: POST /internal/games { whiteId, blackId, timeControl }
3. Game Service:
   a. Generates gameId (UUID v4)
   b. Determines target game server (consistent hash of gameId)
   c. Sends game-created event to that server via internal gRPC
   d. Returns { gameId, wsUrl: "wss://gs47.chess.internal/game/abc123" }
4. Matchmaking worker pushes notification to both players via Redis pub/sub
   → Players receive { type: "game_ready", gameId, wsUrl }
5. Clients connect to the provided wsUrl

7. ELO Rating System

Formula

The standard Elo formula. No magic, just math.

Expected score:
  Ea = 1 / (1 + 10^((Rb - Ra) / 400))

Rating update:
  Ra_new = Ra + K * (Sa - Ea)

Where:
  Sa = actual score (1 = win, 0.5 = draw, 0 = loss)
  Ea = expected score based on rating difference
  K  = K-factor (sensitivity)

K-factor by rating:
  Rating < 2100  → K = 32   (fast convergence for developing players)
  2100 ≤ R < 2400 → K = 16
  Rating ≥ 2400  → K = 10   (protect top ratings from wild swings)
  First 30 games → K = 40   (provisional rating, faster convergence)
function calculateEloChange(
  ratingA: number,
  ratingB: number,
  resultA: 1 | 0.5 | 0,
  gamesPlayedA: number
): { newRatingA: number; change: number } {
  const K = gamesPlayedA < 30 ? 40 : ratingA < 2100 ? 32 : ratingA < 2400 ? 16 : 10
  const expectedA = 1 / (1 + Math.pow(10, (ratingB - ratingA) / 400))
  const change = Math.round(K * (resultA - expectedA))
  return { newRatingA: ratingA + change, change }
}

Atomicity — run in a single DB transaction

BEGIN;
  -- Lock both player rows to prevent concurrent ELO races
  SELECT id, elo_blitz, games_played FROM players
  WHERE id IN ('playerA', 'playerB') FOR UPDATE;

  -- Compute and apply ELO changes (done in application code)
  UPDATE players SET elo_blitz = 1487, games_played = games_played + 1
  WHERE id = 'playerA';

  UPDATE players SET elo_blitz = 1413, games_played = games_played + 1
  WHERE id = 'playerB';

  -- Mark game as complete with ELO changes
  UPDATE games SET result = '1-0', elo_change_white = 17, elo_change_black = -17,
    ended_at = NOW() WHERE game_id = 'abc123';
COMMIT;

Leaderboard — Redis sorted set

ZADD leaderboard:blitz 1487 "playerA"
ZADD leaderboard:blitz 1413 "playerB"

# Top 100 blitz players
ZREVRANGE leaderboard:blitz 0 99 WITHSCORES

# Player rank
ZREVRANK leaderboard:blitz "playerA"

8. Database Design

Schema

CREATE TABLE players (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  username      VARCHAR(32) UNIQUE NOT NULL,
  email         VARCHAR(255) UNIQUE NOT NULL,
  elo_bullet    SMALLINT NOT NULL DEFAULT 1200,
  elo_blitz     SMALLINT NOT NULL DEFAULT 1200,
  elo_rapid     SMALLINT NOT NULL DEFAULT 1200,
  elo_classical SMALLINT NOT NULL DEFAULT 1200,
  games_played  INTEGER NOT NULL DEFAULT 0,
  country       CHAR(2),
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE games (
  game_id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  white_player_id UUID NOT NULL REFERENCES players(id),
  black_player_id UUID NOT NULL REFERENCES players(id),
  time_control    VARCHAR(16) NOT NULL,   -- "blitz_3_2", "bullet_1_0"
  result          CHAR(7),                -- '1-0', '0-1', '1/2-1/2'
  end_reason      VARCHAR(32),           -- 'checkmate', 'timeout', 'resign', etc.
  elo_change_white SMALLINT,
  elo_change_black SMALLINT,
  pgn             TEXT,                  -- full game in PGN format (~5KB)
  opening_eco     CHAR(3),               -- "B90" (Sicilian Najdorf)
  opening_name    VARCHAR(64),
  started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  ended_at        TIMESTAMPTZ,
  server_id       VARCHAR(16)            -- which game server handled this game
)
PARTITION BY RANGE (started_at);

-- Monthly partitions — drop old partitions instead of slow DELETE
CREATE TABLE games_2025_01 PARTITION OF games
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE games_2025_02 PARTITION OF games
  FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... auto-create via pg_partman

-- Moves stored separately — only needed for deep replay/analysis
-- Hot games: moves are in Kafka. Cold games: moves flushed here on game end.
CREATE TABLE moves (
  game_id     UUID NOT NULL,
  seq         SMALLINT NOT NULL,
  from_sq     CHAR(2) NOT NULL,          -- "e2"
  to_sq       CHAR(2) NOT NULL,          -- "e4"
  promotion   CHAR(1),                   -- 'q', 'r', 'b', 'n' or NULL
  fen_after   VARCHAR(100) NOT NULL,
  think_time  INTEGER NOT NULL,          -- ms player spent thinking
  created_at  TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (game_id, seq)
);
-- Co-locate with games: moves partition matches games partition
-- by including started_at or using declarative partitioning

CREATE INDEX idx_games_white ON games(white_player_id, started_at DESC);
CREATE INDEX idx_games_black ON games(black_player_id, started_at DESC);
CREATE INDEX idx_games_time_control ON games(time_control, started_at DESC);

Read/Write split

Write path (game end, ELO update):
  → Primary PostgreSQL (single-region writes)

Read path (game history, player profile):
  → Read replicas (3× replicated)
  → Redis cache: player profile TTL 60s, game list TTL 30s
  → Cache-aside pattern: miss → replica → populate cache

Never read from DB during active game (hot path is 100% in-memory)

9. Game Replay System

Storing PGN on game end

PGN (Portable Game Notation) encodes a complete game in ~2-5 KB. Store it in the games table directly — no need for a separate file store at this scale.

[Event "Rated Blitz game"]
[Site "chess.beyondcodekarma.in"]
[Date "2025.06.15"]
[White "deveshwar"]
[Black "kasparov_fan"]
[Result "1-0"]
[WhiteElo "1487"]
[BlackElo "1413"]
[TimeControl "180+2"]
[ECO "B90"]

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 ...

Replay API

// GET /api/games/:gameId/replay
interface ReplayResponse {
  pgn: string
  positions: string[]  // FEN at each move (precomputed on game end)
  openingName: string
  openingEco: string
}

// Client-side replay: load PGN, step through with chess.js
function loadReplay(pgn: string) {
  const chess = new Chess()
  chess.loadPgn(pgn)
  const history = chess.history({ verbose: true })

  // Reconstruct position at move N: replay from start (or checkpoint)
  function getFenAtMove(n: number): string {
    const replay = new Chess()
    for (let i = 0; i < n; i++) {
      replay.move(history[i])
    }
    return replay.fen()
  }
}

Checkpointing for fast seeking

Replaying from move 0 to seek to move 80 is O(n). Store FEN snapshots every 20 moves in the moves table (fen_after column). Seeking to move 75 = load checkpoint at move 60, replay 15 moves from there.

Opening classification

// ECO (Encyclopedia of Chess Openings) database: ~2,000 openings
// Static JSON file, ~200 KB, loaded at startup

function classifyOpening(moves: string[]): { eco: string; name: string } | null {
  const key = moves.slice(0, 10).join(' ')  // first 5 moves each side
  return ECO_DATABASE.get(key) ?? null
}

10. Anti-Cheat / Engine Detection

The problem

A human playing with Stockfish assistance would have:

  • Unnaturally consistent move quality (85-95% Stockfish agreement on every move)
  • Very low think time variance (engines don't hesitate)
  • No blunders (humans always blunder, engines never do)

Detection signals

interface GameAnalysis {
  stockfishAgreement: number    // % moves matching Stockfish top choice
  thinkTimeVariance: number     // coefficient of variation of move times
  blunderCount: number          // moves that drop >200 centipawns
  acpl: number                  // average centipawn loss (lower = stronger)
  suspicionScore: number        // weighted composite 0-100
}

function analyzeGame(moves: MoveRecord[], stockfishLines: StockfishLine[]): GameAnalysis {
  const agreements = moves.map((m, i) =>
    m.uci === stockfishLines[i].bestMove ? 1 : 0
  )
  const agreement = agreements.reduce((a, b) => a + b, 0) / moves.length

  // Thresholds for flagging (not banning — humans review)
  // > 90% agreement over 30+ moves = strong signal
  // ACPL < 5 over full game = superhuman (top GMs average ~10-15)

  return { stockfishAgreement: agreement, ... }
}

Architecture: fully async, never block the game

Game ends
  → Game server publishes to Kafka topic "completed-games"
  → Anti-cheat worker (Kafka consumer, separate fleet) processes async
  → Runs Stockfish at depth 20 on every move of the game
  → Computes suspicion score
  → If score > 80: flag account, queue for human review
  → If score > 95 AND 3+ flagged games: auto-suspend + notify support

Never run anti-cheat during a live game — it would reveal analysis in real time.

11. Full Architecture Diagram

                             ┌─────────────────────────────────────────┐
                             │                 CLIENTS                  │
                             │  Browser (WS)  ·  Mobile (WS)           │
                             └────────────┬────────────────┬────────────┘
                                          │                │
                                    WebSocket           REST/HTTP
                                          │                │
                             ┌────────────▼────────────────▼────────────┐
                             │               CDN / Edge                  │
                             │    (Cloudflare — static assets, DDoS)     │
                             └────────────┬────────────────┬────────────┘
                                          │                │
                             ┌────────────▼──────┐  ┌──────▼─────────────┐
                             │  WebSocket Gateway │  │    API Gateway      │
                             │  nginx / Envoy     │  │  (REST endpoints)   │
                             │  50K conns/process │  │  matchmaking, auth  │
                             └────────────┬───────┘  └──────┬─────────────┘
                                          │                  │
                          Consistent hash │                  │
                          on gameId       │                  │
                             ┌────────────▼──────────────────▼─────────┐
                             │          Game Server Ring                 │
                             │   ┌──────┐  ┌──────┐  ┌──────┐          │
                             │   │ GS-1 │  │ GS-2 │  │ GS-N │  × 300   │
                             │   │5K    │  │5K    │  │5K    │          │
                             │   │games │  │games │  │games │          │
                             │   └──┬───┘  └──┬───┘  └──┬───┘          │
                             │      │  Stockfish Pool     │              │
                             │      │  (sidecar, 7 procs) │              │
                             └──────┼─────────┼───────────┼─────────────┘
                                    │         │           │
                          Move WAL  │         │           │
                             ┌──────▼─────────▼───────────▼─────────────┐
                             │              Apache Kafka                  │
                             │   Topic: game-moves   (20 partitions)      │
                             │   Topic: completed-games                   │
                             │   200K moves/sec · 7-day retention         │
                             └──────┬──────────────────────┬─────────────┘
                                    │                      │
                          ┌─────────▼────────┐   ┌────────▼───────────────┐
                          │   Move Persister  │   │  Anti-Cheat Worker      │
                          │ (Kafka consumer)  │   │  (Kafka consumer)       │
                          │ Batch INSERT      │   │  Stockfish depth 20     │
                          │ to moves table    │   │  Flags suspicious games  │
                          └─────────┬─────────┘   └─────────────────────────┘

                             ┌──────▼─────────────────────────────────────┐
                             │           PostgreSQL (primary + replicas)   │
                             │   games (partitioned by month)              │
                             │   moves  ·  players                         │
                             │   Primary (writes) + 3× Read Replicas       │
                             └────────────────────────────────────────────┘

                             ┌──────────────────────────────────────────┐
                             │           Supporting Services              │
                             │                                            │
                             │  ┌──────────────┐  ┌───────────────────┐  │
                             │  │ Matchmaking   │  │   ELO Service     │  │
                             │  │ Service       │  │ (triggered on     │  │
                             │  │               │  │  game end)        │  │
                             │  └──────┬────────┘  └────────┬──────────┘  │
                             │         │                    │              │
                             │  ┌──────▼────────────────────▼──────────┐  │
                             │  │              Redis Cluster             │  │
                             │  │  queue:blitz    (matchmaking ZSET)    │  │
                             │  │  queue:bullet   (matchmaking ZSET)    │  │
                             │  │  leaderboard:*  (ELO sorted set)      │  │
                             │  │  session:*      (player sessions)     │  │
                             │  │  game:*         (short-lived cache)   │  │
                             │  └───────────────────────────────────────┘  │
                             └──────────────────────────────────────────┘

12. Scaling to 1M Concurrent Games

Game server scaling

Target:  1M concurrent games
Per server: 5,000 games (CPU-bound on move validation + broadcast)
Servers needed: 200 (+ 50% buffer = 300)
Auto-scaling trigger: CPU > 60% sustained 2 min → add 20 servers
Consistent hash ring: virtual nodes ensure even distribution
  even when ring size changes (rebalance at most 1/N games)

WebSocket connection scaling

nginx configuration (per gateway node):
  worker_processes auto;          # = vCPU count
  worker_connections 65536;       # per worker
  use epoll;                      # Linux event loop — O(1) I/O
  keepalive_timeout 3600s;        # hold WS connections

Per nginx process: ~50K WS connections
Gateway nodes: 2.5M connections / 50K = 50 nginx nodes
  + reserve: 70 nginx nodes total

Kafka throughput

200K moves/sec × 200 bytes/move = 40 MB/s ingress
20 partitions → 10 partitions for game-moves, 10 for completed-games
  → 20 MB/s per partition — well within 1 Gb/s Kafka broker NICs

Consumer groups:
  - move-persister (10 consumers, batch INSERT to Postgres)
  - anti-cheat (10 consumers, async Stockfish analysis)
  - replay-indexer (2 consumers, update search indexes)

Database scaling

Write load:
  1M games/day → game ends: INSERT game + UPDATE players = ~12 writes/sec
  40M moves/day → 463 moves/sec (batch inserted, not real-time)
  Both trivially handled by a single Postgres primary (can do 10K writes/sec)

Read load:
  Game history, profiles, leaderboard: ~100K reads/sec at peak
  Cache hit rate target: 90% → ~10K reads/sec to Postgres
  3 read replicas handle this easily

Partition management:
  Monthly partitions drop after 24 months (data archived to S3 as parquet)
  Keeps hot tables small, queries fast

Hot path vs cold path

HOT PATH (active game):
  Move received → validate (in-memory) → broadcast (in-memory)
  → async write to Kafka
  Zero database reads. Zero Redis reads.
  Latency: 5-15ms server processing + network RTT

COLD PATH (game history, profile, leaderboard):
  API request → check Redis cache → miss → read replica → populate cache
  Cache TTLs: player profile 60s, game list 30s, leaderboard 10s
  Latency: 50-200ms (acceptable for non-live data)

13. Interview Follow-Up Questions

These always come up. Have crisp answers ready.

"How do you handle a game server crash mid-game?"

Kafka is the source of truth. Every move is durably written to Kafka before the server ACKs the client. On crash, the consistent hash re-routes the gameId to another server. That server reads the Kafka partition for the gameId, replays all moves to reconstruct in-memory state, and accepts the players' WebSocket reconnections. Total downtime: 3-5 seconds. The 99.99% uptime target allows ~52 minutes of downtime/year — a 5-second blip per crash is negligible.

"How do you support tournaments with 10,000 players?"

Create a dedicated Tournament Service. For Swiss tournaments: pre-compute all pairings per round, bulk-create games, route to dedicated game servers (isolated consistent hash ring). Arena tournaments (all games simultaneous): same as regular matchmaking but filtered to registered participants. The key isolation: don't let a 10K-player tournament burst saturate the regular matchmaking queues.

"How do you handle clock synchronization with laggy clients?"

The server is the authoritative clock. The server records lastMoveTs = Date.now() on every move and deducts elapsed time from the active player's clock. The client displays an estimated countdown by tracking network RTT and predicting server time. On move broadcast, the server sends authoritative whiteClock and blackClock values and the client snaps to those. A player with 200ms lag doesn't gain extra time — they just see their clock jump down on the next move.

"Why not use HTTP polling instead of WebSocket?"

At 200K moves/sec with HTTP/1.1: each move requires a full HTTP request (headers ~800 bytes + TLS overhead). That's 200K × 800 bytes = 160 MB/s of headers alone — vs WebSocket's 2-14 byte frame overhead. Polling also adds 50-500ms latency (poll interval). You'd need 10× more servers, have worse latency, and a worse user experience. WebSocket is not premature optimization here — it's the correct tool.

"What if ELO updates race (two games end simultaneously)?"

The SELECT ... FOR UPDATE lock on both player rows in the ELO transaction prevents races. Only one transaction can hold the lock at a time. Since ELO updates are 23/sec average, lock contention is negligible. For the rare super-active player finishing games back-to-back, a 10-100ms wait for the lock is acceptable.

"How do you scale the matchmaking queue?"

Redis sorted sets handle 100K+ operations/sec on a single node. At 10M DAU with perhaps 1% in the matchmaking queue at peak = 100K concurrent queue entries. Redis can handle this trivially. If queues become a bottleneck (unlikely), shard by time control: separate Redis instances for bullet, blitz, rapid, classical.

"How do you prevent players from seeing opponent's clock being wrong after reconnect?"

When a player reconnects after a disconnect, the server sends a full game state sync message with authoritative clocks accounting for time elapsed during the disconnection. The player's clock continues to tick on the server during disconnect — they don't get "free time" by disconnecting.


Summary: Key Decisions

| Decision | Choice | Why | |---|---|---| | Real-time transport | WebSocket | Bidirectional, minimal overhead, native browser support | | Game state storage | In-memory on game server | Zero DB latency on hot path | | Durability | Kafka WAL | Replay recovery, async persistence, decouples writers from readers | | Routing | Consistent hash by gameId | Sticky routing without session cookies; rebalances gracefully | | ELO | Single DB transaction | Prevents rating drift from concurrent game completions | | Matchmaking | Redis sorted set (ZRANGEBYSCORE) | O(log N) nearest-ELO lookup, atomic remove-on-match | | Stockfish | Process pool (reuse processes) | Amortizes 500ms startup cost across many games | | Anti-cheat | Async Kafka consumer | Never blocks live game, doesn't telegraph analysis to cheater | | Partitioning | Monthly range partitions | Fast DROP for old data, queries stay on small hot partitions | | Leaderboard | Redis sorted set | O(log N) insert, O(log N + K) top-K query, atomic updates |