System Design · Topic 10 of 16

Consistent Hashing

150 XP

The Problem: Naive Modulo Sharding Breaks on Every Change

Most engineers reach for key % N when sharding data across N servers. It’s simple, deterministic, and works perfectly until you need to add or remove a server.

Suppose you have 4 cache nodes and a key hashes to 17. 17 % 4 = 1, so the key lives on node 1. Now you add a 5th node. Suddenly 17 % 5 = 2 — the key moved to node 2. Almost every key in the cluster gets remapped, causing a cache invalidation storm that can take down your origin servers.

The math is brutal: when going from N to N+1 nodes, approximately (N-1)/N keys move — that’s ~75% of all keys when scaling from 4 to 5 nodes.

Naive modulo: 4 nodes → 5 nodes

Before: key=17 → node 17%4 = 1
After:  key=17 → node 17%5 = 2   ← moved!

Before: key=22 → node 22%4 = 2
After:  key=22 → node 22%5 = 2   ← stayed (lucky)

Before: key=11 → node 11%4 = 3
After:  key=11 → node 11%5 = 1   ← moved!

Empirically: ~75-80% of all keys remapped on a 4→5 scaling event.

This is catastrophic for:

  • Distributed caches (Memcached, Redis): every remapped key is a cache miss that hits the database
  • Sharded databases: data must physically migrate between nodes
  • Partitioned queues: consumers lose track of which partition holds their data

Consistent hashing solves this by ensuring only K/N keys move when a node is added, where K is the total number of keys and N is the new node count — the theoretical minimum.


The Hash Ring

Consistent hashing maps both nodes and keys onto the same circular hash space — a ring of values from 0 to 2³² - 1 (or 0 to 2⁶⁴ - 1 for 64-bit hashes).

          0
        /   \
    2^32     268M
      |         |
    Node C   Node A
      |         |
    805M     536M
        \   /
        Node B
        (at 536M)

Key lookup: hash the key, walk clockwise until you hit a node.

Placement algorithm:

  1. Hash each node identifier (e.g., hash("node-A")) to a position on the ring
  2. To find which node owns a key: hash(key) → walk clockwise on the ring → first node you reach owns the key

Node addition: Only the keys between the new node and its predecessor get remapped. All other keys stay put.

Node removal: Only the keys owned by the removed node get remapped to its clockwise successor.

Adding Node D between Node A and Node B:

Before:
  Keys in range (Node A → Node B): owned by Node B

After:
  Keys in range (Node A → Node D): owned by Node D   ← moved
  Keys in range (Node D → Node B): owned by Node B   ← unchanged

Only ~1/N keys move on average — the minimum theoretically possible.

The Hot Spot Problem and Virtual Nodes

A naive hash ring with one point per physical node causes two severe problems:

1. Uneven distribution: With 3 nodes hashed to random positions, some arcs are much longer than others. The node at the end of a long arc owns far more keys than others.

2. Heterogeneous hardware is ignored: A node with 64 GB RAM should handle more keys than one with 16 GB, but a single ring point gives them equal share.

Virtual nodes (vnodes) solve both problems. Instead of one ring position per physical node, you assign many positions. Each position is a vnode; multiple vnodes map to the same physical node.

Physical ring with 3 nodes — highly uneven:

          0
        /   \
    [C]       [A]
    |              \
    |                [B]
    (C owns 40% of ring, A owns 35%, B owns 25%)


Virtual nodes — each physical node gets 150 positions:

  A1  C3  B2  A7  C1  A3  B5  C2  A2  B1  A4  C4  B3 ...
  ←──────────────────────────── ring ──────────────────────────→

  Each node owns ~33% of the ring regardless of hash luck.
  Add Node D: D gets 150 vnodes, each steals keys from a different neighbor.

Why 150 vnodes per node? The law of large numbers: with enough vnodes, each physical node converges on 1/N of the ring regardless of hash collisions. Amazon DynamoDB originally used 256 vnodes per node. Cassandra defaults to 256 per node as well.

Weighted vnodes for heterogeneous hardware:

  • 16 GB node → 100 vnodes
  • 32 GB node → 200 vnodes
  • 64 GB node → 400 vnodes

The proportion of vnodes determines the proportion of data owned, giving you fine-grained load control without rebalancing.


TypeScript Implementation

A production-grade consistent hash ring backed by a sorted structure:

import { createHash } from "crypto";

type HashFunction = (input: string) => number;

function defaultHash(input: string): number {
  const hex = createHash("md5").update(input).digest("hex").slice(0, 8);
  return parseInt(hex, 16); // 32-bit unsigned integer
}

class SortedRing {
  private points: number[] = [];

  insert(point: number): void {
    let lo = 0, hi = this.points.length;
    while (lo < hi) {
      const mid = (lo + hi) >>> 1;
      if (this.points[mid] < point) lo = mid + 1;
      else hi = mid;
    }
    this.points.splice(lo, 0, point);
  }

  remove(point: number): void {
    const idx = this.points.indexOf(point);
    if (idx !== -1) this.points.splice(idx, 1);
  }

  /** Returns the index of the first point >= value, wrapping to 0 */
  successor(value: number): number {
    let lo = 0, hi = this.points.length;
    while (lo < hi) {
      const mid = (lo + hi) >>> 1;
      if (this.points[mid] < value) lo = mid + 1;
      else hi = mid;
    }
    return lo % this.points.length;
  }

  get(index: number): number {
    return this.points[index];
  }

  get size(): number {
    return this.points.length;
  }
}

interface ConsistentHashOptions {
  vnodes?: number;       // virtual nodes per physical node
  hashFn?: HashFunction; // pluggable hash function
}

export class ConsistentHashRing {
  private readonly vnodes: number;
  private readonly hashFn: HashFunction;
  private readonly ring = new SortedRing();
  private readonly pointToNode = new Map<number, string>();
  private readonly nodeToPoints = new Map<string, number[]>();

  constructor(options: ConsistentHashOptions = {}) {
    this.vnodes = options.vnodes ?? 150;
    this.hashFn = options.hashFn ?? defaultHash;
  }

  addNode(nodeId: string): void {
    if (this.nodeToPoints.has(nodeId)) return;

    const points: number[] = [];
    for (let i = 0; i < this.vnodes; i++) {
      const point = this.hashFn(`${nodeId}#${i}`);
      this.ring.insert(point);
      this.pointToNode.set(point, nodeId);
      points.push(point);
    }
    this.nodeToPoints.set(nodeId, points);
  }

  removeNode(nodeId: string): void {
    const points = this.nodeToPoints.get(nodeId);
    if (!points) return;

    for (const point of points) {
      this.ring.remove(point);
      this.pointToNode.delete(point);
    }
    this.nodeToPoints.delete(nodeId);
  }

  getNode(key: string): string | null {
    if (this.ring.size === 0) return null;
    const hash = this.hashFn(key);
    const idx = this.ring.successor(hash);
    const point = this.ring.get(idx);
    return this.pointToNode.get(point) ?? null;
  }

  /** Returns N nodes for replication, walking clockwise */
  getNodes(key: string, replicationFactor: number): string[] {
    if (this.ring.size === 0) return [];

    const hash = this.hashFn(key);
    const startIdx = this.ring.successor(hash);
    const nodes = new Set<string>();
    const result: string[] = [];
    let idx = startIdx;

    while (result.length < replicationFactor && result.length < this.nodeToPoints.size) {
      const point = this.ring.get(idx % this.ring.size);
      const node = this.pointToNode.get(point)!;
      if (!nodes.has(node)) {
        nodes.add(node);
        result.push(node);
      }
      idx++;
    }

    return result;
  }

  /** Diagnostic: distribution stats across physical nodes */
  getDistribution(): Record<string, number> {
    const counts: Record<string, number> = {};
    for (const [, node] of this.pointToNode) {
      counts[node] = (counts[node] ?? 0) + 1;
    }
    return counts;
  }
}

Usage:

const ring = new ConsistentHashRing({ vnodes: 150 });

ring.addNode("cache-1");
ring.addNode("cache-2");
ring.addNode("cache-3");

console.log(ring.getNode("user:12345"));  // "cache-2"
console.log(ring.getNodes("user:12345", 3)); // ["cache-2", "cache-1", "cache-3"]

// Scale out — only ~25% of keys remapped
ring.addNode("cache-4");

// Weighted nodes for heterogeneous hardware
const weighted = new ConsistentHashRing({ vnodes: 100 }); // base weight
// Then add a "heavy" node with more vnodes manually or via a multiplier parameter

Replication on the Ring

Most distributed systems using consistent hashing don’t store data on just one node. They use a replication factor R, meaning data is stored on R consecutive nodes clockwise from the key’s hash position.

Replication factor = 3, Key K hashes between Node A and Node B:

Ring: ... → Node A → Node B → Node C → Node D → ...

Primary:  Node B  (first node clockwise from hash(K))
Replica1: Node C  (second node clockwise)
Replica2: Node D  (third node clockwise)

If Node B fails, Node C and D still serve K.
If Node B and C fail, Node D alone serves K (degraded mode).

This is exactly how Apache Cassandra replicates. The replication factor is configured per keyspace, and the ring walk for replicas skips nodes that belong to the same rack or datacenter (rack-aware replication strategy).


How DynamoDB Uses Consistent Hashing

Amazon’s Dynamo paper (2007) — the ancestor of DynamoDB — is the canonical reference for consistent hashing in production.

Key Dynamo design decisions:

Preference list: For a given key, Dynamo selects the top N healthy nodes from the ring walk. This isn’t just “first N nodes” — it skips failed nodes and maintains a longer list to guarantee N replicas.

Sloppy quorum + hinted handoff: If a replica node is down, another node temporarily takes the write and stores a “hint” — metadata saying “this data belongs to node X when it comes back”. This enables always-on writes at the cost of potential temporary inconsistency.

Merkle trees for anti-entropy: Each node maintains a Merkle tree of its key ranges. Nodes compare tree roots to detect divergence, then exchange only the differing subtrees to synchronize — far more efficient than comparing all data.

Vector clocks for conflict resolution: Multiple concurrent writes can produce conflicting versions. Dynamo uses vector clocks to track causality and surfaces conflicts to the application layer for resolution (last-write-wins, or custom merge logic).


How Cassandra Uses Consistent Hashing

Cassandra uses consistent hashing for its partitioner — the component that maps partition keys to ring positions.

-- Cassandra partition key determines ring placement
CREATE TABLE orders (
    user_id   UUID,
    order_id  UUID,
    amount    DECIMAL,
    PRIMARY KEY (user_id, order_id)  -- user_id is the partition key
);

-- All orders for user_id hash to the same ring position
-- Good: co-located for efficient queries
-- Bad: hot partition if one user has millions of orders

Murmur3Partitioner (default since Cassandra 1.2): Uses MurmurHash3 to distribute partition keys. Unlike the older RandomPartitioner (MD5-based), it’s not cryptographically secure but is ~3× faster.

Token assignment: In Cassandra 3.x+ with vnodes (num_tokens: 256 in cassandra.yaml), each node is assigned 256 random tokens. When a node joins the cluster, it generates random tokens and negotiates with existing nodes to take ownership of appropriate ranges.

# See token distribution
nodetool ring

# Check load per node (should be roughly equal)
nodetool status

# Manual token range inspection
nodetool describering <keyspace>

Redis Cluster Sharding

Redis Cluster uses a variant of consistent hashing based on hash slots — a fixed partition of 16,384 slots.

Redis Cluster: 16,384 hash slots divided across nodes

Node A: slots 0     - 5460
Node B: slots 5461  - 10922
Node C: slots 10923 - 16383

Key mapping: HASH_SLOT = CRC16(key) mod 16384

This is deterministic and static, not a full consistent hash ring — but the slot abstraction achieves similar goals. When a node is added, slots are manually migrated:

# Redis Cluster node operations
redis-cli --cluster add-node new_host:6379 existing_host:6379
redis-cli --cluster reshard existing_host:6379
redis-cli --cluster rebalance existing_host:6379 --cluster-use-empty-masters

Hash tags: {user_id}.orders and {user_id}.profile — the key inside {} determines the slot, so both keys always land on the same node. Critical for multi-key operations and MULTI/EXEC transactions.


Consistent Hashing vs. Rendezvous Hashing

Rendezvous hashing (also called Highest Random Weight, HRW) takes a different approach:

function rendezvousGetNode(key: string, nodes: string[]): string {
  let maxScore = -Infinity;
  let winner = nodes[0];

  for (const node of nodes) {
    // Score = hash of the (key, node) pair
    const score = hash(`${key}:${node}`);
    if (score > maxScore) {
      maxScore = score;
      winner = node;
    }
  }

  return winner;
}

Comparison:

PropertyConsistent HashingRendezvous Hashing
Lookup complexityO(log N) with sorted ringO(N) — iterate all nodes
Remapping on node change~K/N keys~K/N keys
Implementation complexityHigh (ring + vnodes)Very low
Weighted nodesVia vnode countVia score weighting
Memory usageO(N × vnodes)O(N)
Best forLarge clusters (100+ nodes)Small-medium clusters

Rendezvous hashing is used by Nginx’s hash directive for upstream selection and by some CDN edge selection algorithms. For large clusters, the O(N) lookup cost becomes prohibitive.


Jump Consistent Hash

Google’s jump consistent hash (Lamping & Veach, 2014) achieves O(1) lookup with zero memory overhead — no ring, no sorted structure:

function jumpHash(key: bigint, numBuckets: number): number {
  let b = -1n;
  let j = 0n;

  while (j < BigInt(numBuckets)) {
    b = j;
    key = key * 2862933555777941757n + 1n;
    // Simulate 64-bit overflow with BigInt masking
    key = key & 0xFFFFFFFFFFFFFFFFn;
    j = BigInt(Math.floor(
      (Number(b) + 1) * (Number(1n << 31n) / Number((key >> 33n) + 1n))
    ));
  }

  return Number(b);
}

Properties:

  • Perfectly uniform distribution (mathematical proof)
  • O(log N) iterations on average (not O(N))
  • Zero memory — no data structure to maintain
  • Limitation: Node removal must be from the end. You can’t remove an arbitrary node — buckets must be numbered 0..N-1 with removal only at N-1.

This makes jump consistent hash ideal for stateless sharding (e.g., batch job assignment) but unsuitable for distributed caches where arbitrary nodes fail.


Data Migration When Nodes Join and Leave

Understanding the data movement mechanics is essential for production operations:

Node joining:

Before: Ring has nodes A, B, C
        Node B owns range (A → B]

After: Node D joins at position between A and B
        Node D now owns range (A → D]   ← D receives this data from B
        Node B still owns range (D → B]  ← B retains this data

Migration: B must stream keys in range (A → D] to D
           Other nodes are unaffected

Cassandra streaming on bootstrap:

# When a new node joins:
# 1. New node announces itself via gossip
# 2. Existing nodes stream data for the new node's token ranges
# 3. New node becomes available for reads/writes after streaming

# Monitor bootstrap progress
nodetool netstats
nodetool tpstats | grep Streaming

# If bootstrap fails, you can resume
nodetool bootstrap resume

The double-write window: During data migration, both old and new owners may receive writes. Systems handle this with:

  1. Versioned writes: New owner accepts all writes; old owner forwards reads to new owner
  2. Two-phase migration: Write to both simultaneously, then cut over
  3. Range tokens: The new node owns its range immediately; old owner forwards for a grace period

Interview: “How Would You Shard a Key-Value Store?”

This is the canonical system design interview question that consistent hashing was invented to answer. A complete answer:

Step 1: Define requirements

  • Read/write ratio (cache = read-heavy, session store = write-heavy)
  • Consistency requirements (strong, eventual, causal?)
  • Scale targets (number of keys, operations/sec, data size)
  • Availability requirements (can we tolerate node loss?)

Step 2: Choose sharding strategy

  • Consistent hashing for a distributed cache or database
  • Range sharding for time-series data (query efficiency)
  • Directory-based sharding for maximum flexibility (with coordinator overhead)

Step 3: Design the ring

  • 150–256 vnodes per physical node
  • MD5 or MurmurHash3 as hash function
  • Replication factor based on availability requirements (typically 3)

Step 4: Handle failure

  • Gossip protocol for failure detection
  • Sloppy quorum for availability under failure
  • Hinted handoff for temporarily unavailable nodes

Step 5: Handle rebalancing

  • New node steals ranges from neighbors
  • Streaming protocol with backpressure
  • Double-write window during migration

Step 6: Client-side vs. server-side routing

  • Client-side: client has ring state, routes directly (Cassandra driver model)
  • Server-side: any node accepts requests and proxies to correct owner (Dynamo model)
  • Proxy-based: dedicated routing tier (Redis Cluster proxy)

Real-World Failure Scenarios

Scenario 1: Cascading hot spot Node B fails. Its keys route to Node C (successor). Node C now handles 2× load. If Node C is already near capacity, it may fail under the additional load, causing its keys to cascade to Node D — a classic cascading failure.

Mitigation: Over-provision cluster to 60–70% utilization maximum. Use circuit breakers at the application layer. Consider load-based routing (reject overloaded nodes).

Scenario 2: Network partition splits the ring Nodes on each side of a partition still accept writes (if using quorum). When the partition heals, conflicting versions must be reconciled.

Mitigation: Vector clocks, CRDTs (Conflict-free Replicated Data Types), or last-write-wins with monotonic timestamps.

Scenario 3: Thundering herd on cache miss Node C fails. All keys that were on C miss from cache simultaneously and hit the database. A single node failure can generate millions of cache misses per second.

Mitigation: Request coalescing (only one request per key goes to the database while others wait), jittered retry, circuit breakers on the database.


Interview Checklist

Before walking out of a consistent hashing interview, verify you have covered:

  • Problem statement: Explain why modulo hashing fails on scale-out
  • Ring mechanics: Hash space, clockwise lookup, node placement
  • Virtual nodes: Why they’re necessary, typical counts (150–256)
  • Replication factor: Clockwise walk for replicas, rack awareness
  • Node addition/removal: Only K/N keys move, migration protocol
  • Hotspots: Cascading failures, mitigation strategies
  • Client routing: Client-side vs. server-side, stale routing table
  • Real systems: Cassandra, DynamoDB, Redis Cluster (hash slots)
  • Alternatives: Rendezvous hashing (O(N)), jump consistent hash (O(log N), ordered removal only)
  • Failure scenarios: Partition, cascading overload, thundering herd

Key numbers to know:

  • Cassandra default vnodes: 256 per node
  • Redis Cluster hash slots: 16,384
  • Keys remapped on modulo resize: ~(N-1)/N ≈ 75% for N=4→5
  • Keys remapped on consistent hash resize: ~1/(N+1) ≈ 20% for N=4→5
  • Typical replication factor: 3 (survives 2 node failures)

Kleppmann’s DDIA reference: Chapter 6 (Partitioning) covers consistent hashing, partition rebalancing, and request routing in depth. The treatment of “unfair splits” with random node placement and the vnode solution maps exactly to what you need for interviews.