You're designing a real-time multiplayer chess platform at lichess.org / chess.com scale. The site already has a single-player browser game (minimax AI). This design covers scaling that into a full multiplayer platform: matchmaking, live games, spectators, ELO rating, game history, and engine integration — all at 1M concurrent games.
This is a classic FAANG system design interview question. The hard parts are stateful WebSocket routing, server-authoritative game clocks, Stockfish process pooling, and consistent game recovery after crashes.
1. Requirements Clarification
Nail down scope before drawing boxes. Every interviewer has a different idea of what "chess platform" means.
Functional requirements
| Feature | Notes | |---|---| | 1v1 live games | Full chess rules, all time controls | | AI opponent | Minimax (browser) or Stockfish (server), multiple difficulty levels | | Matchmaking | Pair by ELO ± window, by time control | | Spectators | Watch any live game, low latency | | Game history & replay | Move-by-move replay, opening classification | | ELO rating | Per time control (bullet/blitz/rapid/classical), K-factor by rating band | | Tournaments | Swiss/round-robin brackets, arena (stretch) |
Non-functional requirements
| Metric | Target | |---|---| | Concurrent games | 1M | | Move latency | <50ms p99 globally | | DAU | 10M | | Uptime for active games | 99.99% (52 min downtime/year) | | Game history durability | 99.9999% — never lose a completed game |
Out of scope (say this explicitly in interviews)
- Video/voice chat
- Puzzles, lessons, analysis board
- Mobile apps (assume web clients)
- Payment / subscriptions
2. Capacity Estimation
Do the math on the whiteboard. It justifies every architectural decision that follows.
Connections & traffic
Active games: 1,000,000 concurrent
WebSocket connections: 2M players + ~500K spectators = ~2.5M WS connections
Move rate:
1M games × avg 1 move per 5s = 200,000 moves/sec
Peak (tournament burst):
3× baseline = 600,000 moves/sec
Storage
Moves per game: ~40 moves average (80 half-moves / plies)
Games per day: 1M games/day (10M DAU × ~0.1 games/session)
Raw move storage:
40 moves × 100 bytes/move × 1M games = 4 GB/day
PGN storage (compressed):
1M games × 5 KB/game = 5 GB/day → ~15 GB/day with metadata
After 1 year: ~5 TB raw game data
With 3× replication: ~15 TB
Cold storage (S3): Compress 10:1 → 1.5 TB/year
ELO updates
1M games/day × 2 players = 2M ELO updates/day
= ~23 updates/sec — trivially handled by any RDBMS
Game servers
Each game server: 32 vCPUs, 128 GB RAM
In-memory game state: ~50 KB per active game (FEN + move history + clocks)
Games per server: 128 GB × 0.6 / 50 KB ≈ 1,536,000 → cap at 5,000 (CPU bound)
Servers needed: 1M / 5K = 200 game servers
+ 50% headroom = 300 game servers
3. Chess Engine Integration
The existing Wizard's Chess uses client-side minimax. Here's the full engine strategy at platform scale.
Three-tier engine architecture
Tier 1: Browser minimax (existing)
- TypeScript, alpha-beta pruning, depth 4-6
- Zero server cost, works offline
- Strength: ~1000 ELO equivalent
- Use for: "play vs computer" casual mode
Tier 2: WASM Stockfish (browser)
- stockfish.wasm bundled with the app (~6 MB)
- Runs in Web Worker, SharedArrayBuffer for comms
- Depth 12-15, strength ~2200 ELO
- Use for: analysis board, "easy/medium" AI games
- How lichess does it: stockfish.js + WASM SIMD where available
Tier 3: Server-side Stockfish (C++ subprocess)
- Depth 18-22, strength 3200+ ELO
- UCI protocol over stdin/stdout
- Use for: "hard AI" games, anti-cheat analysis
Server-side Stockfish pool (Rust/tokio)
The critical insight: never spawn a new Stockfish process per request. Stockfish startup is ~500ms. Instead, maintain a warm pool and reuse processes across games.
// Engine pool — one pool per difficulty level
struct EnginePool {
engines: Vec<Arc<Mutex<StockfishProcess>>>,
queue: Arc<Mutex<VecDeque<MoveRequest>>>,
}
impl EnginePool {
async fn request_move(&self, fen: &str, depth: u8, timeout_ms: u64) -> String {
// Acquire an idle engine from the pool
let engine = self.acquire_engine().await;
engine.send_command(&format!("position fen {}", fen)).await;
engine.send_command(&format!("go depth {}", depth)).await;
// Parse "bestmove e2e4 ponder e7e5" from UCI output
let bestmove = engine.read_until_bestmove(timeout_ms).await;
self.release_engine(engine).await;
bestmove
}
}
Pool sizing:
"Hard AI" games: 1 Stockfish per 50 concurrent AI games (Stockfish thinks ~2s/move)
1M AI games × 10% = 100K AI games
Pool size: 100K / 50 = 2,000 Stockfish processes across all game servers
Per server (300 servers): ~7 Stockfish processes
RAM: 7 × 200 MB = 1.4 GB/server — acceptable
WASM Stockfish setup (browser)
// Spawn Stockfish in a Web Worker
const worker = new Worker('/stockfish.js')
function requestBestMove(fen: string, depth: number): Promise<string> {
return new Promise((resolve) => {
worker.postMessage(`position fen ${fen}`)
worker.postMessage(`go depth ${depth}`)
worker.onmessage = (e: MessageEvent<string>) => {
if (e.data.startsWith('bestmove')) {
// "bestmove e2e4 ponder e7e5"
resolve(e.data.split(' ')[1])
}
}
})
}
4. Real-Time Move Protocol
Transport choice: WebSocket wins
| Protocol | Latency | Overhead | Bidirectional | Verdict | |---|---|---|---|---| | HTTP Polling | High (poll interval) | Very high (full HTTP headers) | No | ❌ | | SSE | Low | Low | Server → client only | ❌ | | WebSocket | Lowest | Minimal (2-14 byte frames) | Yes | ✅ | | WebRTC | Lowest | High (STUN/TURN setup) | Yes | ❌ (P2P unreliable) |
At 200K moves/sec, HTTP polling would require ~10× more servers just for connection overhead.
Move message format
// Client → Server
interface MoveMessage {
type: 'move'
gameId: string
seq: number // monotonically increasing per game, prevents ghost moves
from: string // "e2"
to: string // "e4"
promotion: 'q' | 'r' | 'b' | 'n' | null
clientTs: number // client timestamp for clock sync diagnostics
}
// Server → Client (broadcast to both players + spectators)
interface MoveBroadcast {
type: 'move'
gameId: string
seq: number
from: string
to: string
promotion: 'q' | 'r' | 'b' | 'n' | null
fen: string // authoritative FEN after move
whiteClock: number // ms remaining
blackClock: number // ms remaining
serverTs: number // server timestamp (authoritative)
}
// Server → Client (on illegal move)
interface MoveRejected {
type: 'move_rejected'
gameId: string
seq: number
reason: 'illegal_move' | 'not_your_turn' | 'game_over' | 'clock_expired'
fen: string // reset client to this FEN
}
// Server → Client (game over)
interface GameOver {
type: 'game_over'
gameId: string
result: '1-0' | '0-1' | '1/2-1/2'
reason: 'checkmate' | 'resignation' | 'timeout' | 'draw_agreement' | 'stalemate' | 'insufficient_material'
eloChange: { white: number; black: number }
}
Server-side move validation — never trust the client
import { Chess } from 'chess.js' // or equivalent Rust chess library
function validateAndApplyMove(
game: GameState,
msg: MoveMessage
): MoveBroadcast | MoveRejected {
// 1. Sequence check — reject duplicate or out-of-order moves
if (msg.seq !== game.expectedSeq) {
return { type: 'move_rejected', reason: 'illegal_move', ...}
}
// 2. Turn check
const isWhiteTurn = game.chess.turn() === 'w'
if (isWhiteTurn && msg.playerId !== game.whitePlayerId) {
return { type: 'move_rejected', reason: 'not_your_turn', ...}
}
// 3. Clock check — server is authoritative
const elapsed = Date.now() - game.lastMoveTs
if (elapsed > game.activeClock + CLOCK_GRACE_MS) {
return { type: 'move_rejected', reason: 'clock_expired', ...}
}
// 4. Legality check via chess rules engine
const result = game.chess.move({ from: msg.from, to: msg.to, promotion: msg.promotion })
if (!result) {
return { type: 'move_rejected', reason: 'illegal_move', ...}
}
// 5. Update clocks
game.activeClock -= elapsed
game.activeClock += game.increment
game.lastMoveTs = Date.now()
game.expectedSeq++
return {
type: 'move',
fen: game.chess.fen(),
whiteClock: game.whiteClock,
blackClock: game.blackClock,
...
}
}
Optimistic UI with rollback
// Client-side: render move immediately, revert if rejected
function handleLocalMove(from: string, to: string) {
const savedFen = chessEngine.fen()
// Optimistic: show move immediately
chessEngine.move({ from, to })
renderBoard(chessEngine.fen())
// Send to server
ws.send(JSON.stringify({ type: 'move', from, to, seq: localSeq++ }))
// Set rollback timer — if no server ACK in 3s, something is wrong
const rollbackTimer = setTimeout(() => {
chessEngine.load(savedFen)
renderBoard(savedFen)
}, 3000)
pendingMoves.set(localSeq, { rollbackTimer, savedFen })
}
// Server confirms or rejects
ws.onmessage = (event) => {
const msg = JSON.parse(event.data)
if (msg.type === 'move') {
clearTimeout(pendingMoves.get(msg.seq)?.rollbackTimer)
// Load authoritative FEN — corrects any drift
chessEngine.load(msg.fen)
updateClocks(msg.whiteClock, msg.blackClock)
}
if (msg.type === 'move_rejected') {
const pending = pendingMoves.get(msg.seq)
clearTimeout(pending?.rollbackTimer)
chessEngine.load(msg.fen) // reset to server state
renderBoard(msg.fen)
}
}
5. Stateful Game Server Architecture
The core problem: WebSockets are stateful
HTTP is stateless — any server can handle any request. WebSocket connections are long-lived and hold game state in memory. You can't freely round-robin them. A move from player A and player B for the same game must hit the same server process.
Solution: Sticky routing via consistent hashing
gameId → hash → game server index
gameId "abc123" → SHA256 → mod 300 = server #47
All WebSocket connections for game "abc123" route to server #47
The WebSocket gateway (nginx/Envoy) reads the gameId from the URL path (/ws/game/abc123) and uses a consistent hash upstream to select the game server. This is stateless at the gateway layer — no session affinity cookies needed.
In-memory game state
interface GameState {
gameId: string
whitePlayerId: string
blackPlayerId: string
chess: Chess // chess.js instance — current position
moveHistory: MoveRecord[]
whiteClock: number // ms remaining
blackClock: number // ms remaining
increment: number // ms added per move
lastMoveTs: number // epoch ms of last move (for clock deduction)
expectedSeq: number
spectatorSockets: Set<WebSocket>
whiteSocket: WebSocket | null // null if disconnected
blackSocket: WebSocket | null
status: 'waiting' | 'active' | 'finished'
}
// 300 servers × 5,000 games × 50 KB = 75 GB total in-memory game state
// Fits comfortably in 128 GB RAM per server with headroom
Write-ahead log: every move goes to Kafka
Player move arrives
→ Server validates move
→ Update in-memory GameState
→ Broadcast to players + spectators (fast path, ~1ms)
→ Async write to Kafka topic "game-moves" (non-blocking)
Key: gameId (guarantees ordering per game)
Value: { gameId, seq, from, to, fen_after, clocks, ts }
The game server never reads from the database during active play. The hot path is 100% in-memory. Kafka is the durable record.
Game server crash recovery
1. Game server #47 crashes
2. Load balancer detects health check failure
3. Consistent hash re-routes "abc123" to server #48
4. Both players reconnect (WS reconnect logic in client)
5. Server #48 has no state for game "abc123"
6. Server #48 reads Kafka partition for gameId "abc123"
→ Replays all moves in order → reconstructs GameState
7. Players receive "reconnected" event, game resumes
8. Total downtime: ~3-5 seconds (Kafka replay time for ~40 moves)
async function recoverGame(gameId: string): Promise<GameState> {
// Read all moves for this game from Kafka (low-watermark to high-watermark)
const moves = await kafkaConsumer.getMessagesForKey('game-moves', gameId)
const chess = new Chess()
let whiteClock = initialTime
let blackClock = initialTime
for (const move of moves) {
chess.move({ from: move.from, to: move.to, promotion: move.promotion })
whiteClock = move.whiteClock
blackClock = move.blackClock
}
return { gameId, chess, whiteClock, blackClock, expectedSeq: moves.length, ... }
}
6. Matchmaking System
Requirements
- Match players within ±100 ELO (expand window every 5s to prevent waiting forever)
- Match by time control: bullet (1+0), blitz (3+2), rapid (10+0), classical (30+0)
- Prevent extreme mismatches (cap at ±400 ELO regardless of wait time)
Redis sorted set per queue
Key: "queue:blitz"
Score: ELO rating (numeric — enables ZRANGEBYSCORE)
Value: "playerId:userId123:ts:1718000000:elo:1450"
// Player joins matchmaking
async function joinQueue(userId: string, timeControl: string, elo: number) {
const member = `${userId}:${Date.now()}:${elo}`
await redis.zadd(`queue:${timeControl}`, elo, member)
await redis.expire(`queue:${timeControl}:${userId}`, 120) // auto-expire after 2 min
}
// Matchmaking worker (runs every 500ms)
async function matchmakingTick(timeControl: string) {
// Pull all queued players sorted by ELO
const players = await redis.zrange(`queue:${timeControl}`, 0, -1, 'WITHSCORES')
for (let i = 0; i < players.length - 1; i++) {
const [playerA, eloA] = players[i]
const [playerB, eloB] = players[i + 1]
const eloDiff = Math.abs(eloA - eloB)
const waitSeconds = (Date.now() - parseTs(playerA)) / 1000
// ELO window expands: ±100 base + 20 per 5s waited, capped at ±400
const maxDiff = Math.min(100 + 20 * Math.floor(waitSeconds / 5), 400)
if (eloDiff <= maxDiff) {
// Create game
await redis.zrem(`queue:${timeControl}`, playerA, playerB)
await createGame(playerA, playerB, timeControl)
}
}
}
Game creation flow
1. Matchmaking worker pairs A + B
2. Calls Game Service: POST /internal/games { whiteId, blackId, timeControl }
3. Game Service:
a. Generates gameId (UUID v4)
b. Determines target game server (consistent hash of gameId)
c. Sends game-created event to that server via internal gRPC
d. Returns { gameId, wsUrl: "wss://gs47.chess.internal/game/abc123" }
4. Matchmaking worker pushes notification to both players via Redis pub/sub
→ Players receive { type: "game_ready", gameId, wsUrl }
5. Clients connect to the provided wsUrl
7. ELO Rating System
Formula
The standard Elo formula. No magic, just math.
Expected score:
Ea = 1 / (1 + 10^((Rb - Ra) / 400))
Rating update:
Ra_new = Ra + K * (Sa - Ea)
Where:
Sa = actual score (1 = win, 0.5 = draw, 0 = loss)
Ea = expected score based on rating difference
K = K-factor (sensitivity)
K-factor by rating:
Rating < 2100 → K = 32 (fast convergence for developing players)
2100 ≤ R < 2400 → K = 16
Rating ≥ 2400 → K = 10 (protect top ratings from wild swings)
First 30 games → K = 40 (provisional rating, faster convergence)
function calculateEloChange(
ratingA: number,
ratingB: number,
resultA: 1 | 0.5 | 0,
gamesPlayedA: number
): { newRatingA: number; change: number } {
const K = gamesPlayedA < 30 ? 40 : ratingA < 2100 ? 32 : ratingA < 2400 ? 16 : 10
const expectedA = 1 / (1 + Math.pow(10, (ratingB - ratingA) / 400))
const change = Math.round(K * (resultA - expectedA))
return { newRatingA: ratingA + change, change }
}
Atomicity — run in a single DB transaction
BEGIN;
-- Lock both player rows to prevent concurrent ELO races
SELECT id, elo_blitz, games_played FROM players
WHERE id IN ('playerA', 'playerB') FOR UPDATE;
-- Compute and apply ELO changes (done in application code)
UPDATE players SET elo_blitz = 1487, games_played = games_played + 1
WHERE id = 'playerA';
UPDATE players SET elo_blitz = 1413, games_played = games_played + 1
WHERE id = 'playerB';
-- Mark game as complete with ELO changes
UPDATE games SET result = '1-0', elo_change_white = 17, elo_change_black = -17,
ended_at = NOW() WHERE game_id = 'abc123';
COMMIT;
Leaderboard — Redis sorted set
ZADD leaderboard:blitz 1487 "playerA"
ZADD leaderboard:blitz 1413 "playerB"
# Top 100 blitz players
ZREVRANGE leaderboard:blitz 0 99 WITHSCORES
# Player rank
ZREVRANK leaderboard:blitz "playerA"
8. Database Design
Schema
CREATE TABLE players (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(32) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
elo_bullet SMALLINT NOT NULL DEFAULT 1200,
elo_blitz SMALLINT NOT NULL DEFAULT 1200,
elo_rapid SMALLINT NOT NULL DEFAULT 1200,
elo_classical SMALLINT NOT NULL DEFAULT 1200,
games_played INTEGER NOT NULL DEFAULT 0,
country CHAR(2),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE games (
game_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
white_player_id UUID NOT NULL REFERENCES players(id),
black_player_id UUID NOT NULL REFERENCES players(id),
time_control VARCHAR(16) NOT NULL, -- "blitz_3_2", "bullet_1_0"
result CHAR(7), -- '1-0', '0-1', '1/2-1/2'
end_reason VARCHAR(32), -- 'checkmate', 'timeout', 'resign', etc.
elo_change_white SMALLINT,
elo_change_black SMALLINT,
pgn TEXT, -- full game in PGN format (~5KB)
opening_eco CHAR(3), -- "B90" (Sicilian Najdorf)
opening_name VARCHAR(64),
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ended_at TIMESTAMPTZ,
server_id VARCHAR(16) -- which game server handled this game
)
PARTITION BY RANGE (started_at);
-- Monthly partitions — drop old partitions instead of slow DELETE
CREATE TABLE games_2025_01 PARTITION OF games
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE games_2025_02 PARTITION OF games
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... auto-create via pg_partman
-- Moves stored separately — only needed for deep replay/analysis
-- Hot games: moves are in Kafka. Cold games: moves flushed here on game end.
CREATE TABLE moves (
game_id UUID NOT NULL,
seq SMALLINT NOT NULL,
from_sq CHAR(2) NOT NULL, -- "e2"
to_sq CHAR(2) NOT NULL, -- "e4"
promotion CHAR(1), -- 'q', 'r', 'b', 'n' or NULL
fen_after VARCHAR(100) NOT NULL,
think_time INTEGER NOT NULL, -- ms player spent thinking
created_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (game_id, seq)
);
-- Co-locate with games: moves partition matches games partition
-- by including started_at or using declarative partitioning
CREATE INDEX idx_games_white ON games(white_player_id, started_at DESC);
CREATE INDEX idx_games_black ON games(black_player_id, started_at DESC);
CREATE INDEX idx_games_time_control ON games(time_control, started_at DESC);
Read/Write split
Write path (game end, ELO update):
→ Primary PostgreSQL (single-region writes)
Read path (game history, player profile):
→ Read replicas (3× replicated)
→ Redis cache: player profile TTL 60s, game list TTL 30s
→ Cache-aside pattern: miss → replica → populate cache
Never read from DB during active game (hot path is 100% in-memory)
9. Game Replay System
Storing PGN on game end
PGN (Portable Game Notation) encodes a complete game in ~2-5 KB. Store it in the games table directly — no need for a separate file store at this scale.
[Event "Rated Blitz game"]
[Site "chess.beyondcodekarma.in"]
[Date "2025.06.15"]
[White "deveshwar"]
[Black "kasparov_fan"]
[Result "1-0"]
[WhiteElo "1487"]
[BlackElo "1413"]
[TimeControl "180+2"]
[ECO "B90"]
1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 ...
Replay API
// GET /api/games/:gameId/replay
interface ReplayResponse {
pgn: string
positions: string[] // FEN at each move (precomputed on game end)
openingName: string
openingEco: string
}
// Client-side replay: load PGN, step through with chess.js
function loadReplay(pgn: string) {
const chess = new Chess()
chess.loadPgn(pgn)
const history = chess.history({ verbose: true })
// Reconstruct position at move N: replay from start (or checkpoint)
function getFenAtMove(n: number): string {
const replay = new Chess()
for (let i = 0; i < n; i++) {
replay.move(history[i])
}
return replay.fen()
}
}
Checkpointing for fast seeking
Replaying from move 0 to seek to move 80 is O(n). Store FEN snapshots every 20 moves in the moves table (fen_after column). Seeking to move 75 = load checkpoint at move 60, replay 15 moves from there.
Opening classification
// ECO (Encyclopedia of Chess Openings) database: ~2,000 openings
// Static JSON file, ~200 KB, loaded at startup
function classifyOpening(moves: string[]): { eco: string; name: string } | null {
const key = moves.slice(0, 10).join(' ') // first 5 moves each side
return ECO_DATABASE.get(key) ?? null
}
10. Anti-Cheat / Engine Detection
The problem
A human playing with Stockfish assistance would have:
- Unnaturally consistent move quality (85-95% Stockfish agreement on every move)
- Very low think time variance (engines don't hesitate)
- No blunders (humans always blunder, engines never do)
Detection signals
interface GameAnalysis {
stockfishAgreement: number // % moves matching Stockfish top choice
thinkTimeVariance: number // coefficient of variation of move times
blunderCount: number // moves that drop >200 centipawns
acpl: number // average centipawn loss (lower = stronger)
suspicionScore: number // weighted composite 0-100
}
function analyzeGame(moves: MoveRecord[], stockfishLines: StockfishLine[]): GameAnalysis {
const agreements = moves.map((m, i) =>
m.uci === stockfishLines[i].bestMove ? 1 : 0
)
const agreement = agreements.reduce((a, b) => a + b, 0) / moves.length
// Thresholds for flagging (not banning — humans review)
// > 90% agreement over 30+ moves = strong signal
// ACPL < 5 over full game = superhuman (top GMs average ~10-15)
return { stockfishAgreement: agreement, ... }
}
Architecture: fully async, never block the game
Game ends
→ Game server publishes to Kafka topic "completed-games"
→ Anti-cheat worker (Kafka consumer, separate fleet) processes async
→ Runs Stockfish at depth 20 on every move of the game
→ Computes suspicion score
→ If score > 80: flag account, queue for human review
→ If score > 95 AND 3+ flagged games: auto-suspend + notify support
Never run anti-cheat during a live game — it would reveal analysis in real time.
11. Full Architecture Diagram
┌─────────────────────────────────────────┐
│ CLIENTS │
│ Browser (WS) · Mobile (WS) │
└────────────┬────────────────┬────────────┘
│ │
WebSocket REST/HTTP
│ │
┌────────────▼────────────────▼────────────┐
│ CDN / Edge │
│ (Cloudflare — static assets, DDoS) │
└────────────┬────────────────┬────────────┘
│ │
┌────────────▼──────┐ ┌──────▼─────────────┐
│ WebSocket Gateway │ │ API Gateway │
│ nginx / Envoy │ │ (REST endpoints) │
│ 50K conns/process │ │ matchmaking, auth │
└────────────┬───────┘ └──────┬─────────────┘
│ │
Consistent hash │ │
on gameId │ │
┌────────────▼──────────────────▼─────────┐
│ Game Server Ring │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ GS-1 │ │ GS-2 │ │ GS-N │ × 300 │
│ │5K │ │5K │ │5K │ │
│ │games │ │games │ │games │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ Stockfish Pool │ │
│ │ (sidecar, 7 procs) │ │
└──────┼─────────┼───────────┼─────────────┘
│ │ │
Move WAL │ │ │
┌──────▼─────────▼───────────▼─────────────┐
│ Apache Kafka │
│ Topic: game-moves (20 partitions) │
│ Topic: completed-games │
│ 200K moves/sec · 7-day retention │
└──────┬──────────────────────┬─────────────┘
│ │
┌─────────▼────────┐ ┌────────▼───────────────┐
│ Move Persister │ │ Anti-Cheat Worker │
│ (Kafka consumer) │ │ (Kafka consumer) │
│ Batch INSERT │ │ Stockfish depth 20 │
│ to moves table │ │ Flags suspicious games │
└─────────┬─────────┘ └─────────────────────────┘
│
┌──────▼─────────────────────────────────────┐
│ PostgreSQL (primary + replicas) │
│ games (partitioned by month) │
│ moves · players │
│ Primary (writes) + 3× Read Replicas │
└────────────────────────────────────────────┘
┌──────────────────────────────────────────┐
│ Supporting Services │
│ │
│ ┌──────────────┐ ┌───────────────────┐ │
│ │ Matchmaking │ │ ELO Service │ │
│ │ Service │ │ (triggered on │ │
│ │ │ │ game end) │ │
│ └──────┬────────┘ └────────┬──────────┘ │
│ │ │ │
│ ┌──────▼────────────────────▼──────────┐ │
│ │ Redis Cluster │ │
│ │ queue:blitz (matchmaking ZSET) │ │
│ │ queue:bullet (matchmaking ZSET) │ │
│ │ leaderboard:* (ELO sorted set) │ │
│ │ session:* (player sessions) │ │
│ │ game:* (short-lived cache) │ │
│ └───────────────────────────────────────┘ │
└──────────────────────────────────────────┘
12. Scaling to 1M Concurrent Games
Game server scaling
Target: 1M concurrent games
Per server: 5,000 games (CPU-bound on move validation + broadcast)
Servers needed: 200 (+ 50% buffer = 300)
Auto-scaling trigger: CPU > 60% sustained 2 min → add 20 servers
Consistent hash ring: virtual nodes ensure even distribution
even when ring size changes (rebalance at most 1/N games)
WebSocket connection scaling
nginx configuration (per gateway node):
worker_processes auto; # = vCPU count
worker_connections 65536; # per worker
use epoll; # Linux event loop — O(1) I/O
keepalive_timeout 3600s; # hold WS connections
Per nginx process: ~50K WS connections
Gateway nodes: 2.5M connections / 50K = 50 nginx nodes
+ reserve: 70 nginx nodes total
Kafka throughput
200K moves/sec × 200 bytes/move = 40 MB/s ingress
20 partitions → 10 partitions for game-moves, 10 for completed-games
→ 20 MB/s per partition — well within 1 Gb/s Kafka broker NICs
Consumer groups:
- move-persister (10 consumers, batch INSERT to Postgres)
- anti-cheat (10 consumers, async Stockfish analysis)
- replay-indexer (2 consumers, update search indexes)
Database scaling
Write load:
1M games/day → game ends: INSERT game + UPDATE players = ~12 writes/sec
40M moves/day → 463 moves/sec (batch inserted, not real-time)
Both trivially handled by a single Postgres primary (can do 10K writes/sec)
Read load:
Game history, profiles, leaderboard: ~100K reads/sec at peak
Cache hit rate target: 90% → ~10K reads/sec to Postgres
3 read replicas handle this easily
Partition management:
Monthly partitions drop after 24 months (data archived to S3 as parquet)
Keeps hot tables small, queries fast
Hot path vs cold path
HOT PATH (active game):
Move received → validate (in-memory) → broadcast (in-memory)
→ async write to Kafka
Zero database reads. Zero Redis reads.
Latency: 5-15ms server processing + network RTT
COLD PATH (game history, profile, leaderboard):
API request → check Redis cache → miss → read replica → populate cache
Cache TTLs: player profile 60s, game list 30s, leaderboard 10s
Latency: 50-200ms (acceptable for non-live data)
13. Interview Follow-Up Questions
These always come up. Have crisp answers ready.
"How do you handle a game server crash mid-game?"
Kafka is the source of truth. Every move is durably written to Kafka before the server ACKs the client. On crash, the consistent hash re-routes the gameId to another server. That server reads the Kafka partition for the gameId, replays all moves to reconstruct in-memory state, and accepts the players' WebSocket reconnections. Total downtime: 3-5 seconds. The 99.99% uptime target allows ~52 minutes of downtime/year — a 5-second blip per crash is negligible.
"How do you support tournaments with 10,000 players?"
Create a dedicated Tournament Service. For Swiss tournaments: pre-compute all pairings per round, bulk-create games, route to dedicated game servers (isolated consistent hash ring). Arena tournaments (all games simultaneous): same as regular matchmaking but filtered to registered participants. The key isolation: don't let a 10K-player tournament burst saturate the regular matchmaking queues.
"How do you handle clock synchronization with laggy clients?"
The server is the authoritative clock. The server records lastMoveTs = Date.now() on every move and deducts elapsed time from the active player's clock. The client displays an estimated countdown by tracking network RTT and predicting server time. On move broadcast, the server sends authoritative whiteClock and blackClock values and the client snaps to those. A player with 200ms lag doesn't gain extra time — they just see their clock jump down on the next move.
"Why not use HTTP polling instead of WebSocket?"
At 200K moves/sec with HTTP/1.1: each move requires a full HTTP request (headers ~800 bytes + TLS overhead). That's 200K × 800 bytes = 160 MB/s of headers alone — vs WebSocket's 2-14 byte frame overhead. Polling also adds 50-500ms latency (poll interval). You'd need 10× more servers, have worse latency, and a worse user experience. WebSocket is not premature optimization here — it's the correct tool.
"What if ELO updates race (two games end simultaneously)?"
The SELECT ... FOR UPDATE lock on both player rows in the ELO transaction prevents races. Only one transaction can hold the lock at a time. Since ELO updates are 23/sec average, lock contention is negligible. For the rare super-active player finishing games back-to-back, a 10-100ms wait for the lock is acceptable.
"How do you scale the matchmaking queue?"
Redis sorted sets handle 100K+ operations/sec on a single node. At 10M DAU with perhaps 1% in the matchmaking queue at peak = 100K concurrent queue entries. Redis can handle this trivially. If queues become a bottleneck (unlikely), shard by time control: separate Redis instances for bullet, blitz, rapid, classical.
"How do you prevent players from seeing opponent's clock being wrong after reconnect?"
When a player reconnects after a disconnect, the server sends a full game state sync message with authoritative clocks accounting for time elapsed during the disconnection. The player's clock continues to tick on the server during disconnect — they don't get "free time" by disconnecting.
Summary: Key Decisions
| Decision | Choice | Why | |---|---|---| | Real-time transport | WebSocket | Bidirectional, minimal overhead, native browser support | | Game state storage | In-memory on game server | Zero DB latency on hot path | | Durability | Kafka WAL | Replay recovery, async persistence, decouples writers from readers | | Routing | Consistent hash by gameId | Sticky routing without session cookies; rebalances gracefully | | ELO | Single DB transaction | Prevents rating drift from concurrent game completions | | Matchmaking | Redis sorted set (ZRANGEBYSCORE) | O(log N) nearest-ELO lookup, atomic remove-on-match | | Stockfish | Process pool (reuse processes) | Amortizes 500ms startup cost across many games | | Anti-cheat | Async Kafka consumer | Never blocks live game, doesn't telegraph analysis to cheater | | Partitioning | Monthly range partitions | Fast DROP for old data, queries stay on small hot partitions | | Leaderboard | Redis sorted set | O(log N) insert, O(log N + K) top-K query, atomic updates |