Design: Chat System — ZERO TO GRIND

A chat system is one of the most technically demanding designs because it combines real-time bidirectional communication, durable storage, multi-server state management, and offline delivery — all with the expectation that messages arrive in order and are never lost. WhatsApp serves 2 billion users with a team of 50 engineers. The architecture is elegant. This walkthrough builds it from scratch.

Step 1 — Clarify requirements

Functional requirements

1-1 chat: Real-time messaging between two users.
Group chat: Up to 500 members in a group. Messages delivered to all online members in real-time.
Online / offline indicators: Show whether a user is currently online and their last seen timestamp.
Message history: Retrieve chat history, paginated (last N messages).
Delivery receipts: Three states — sent ✓, delivered ✓✓, read ✓✓ (blue).
Media sharing: Images, videos, documents (up to 100 MB).
Typing indicators: Show when the other person is typing (ephemeral, no storage needed).

Non-functional requirements

Scale: 50M DAU. Each user sends/receives ~40 messages/day.
Message volume: 50M × 40 = 2B messages/day → ~23,000 messages/sec.
Delivery latency: p99 message delivery < 100ms for online users.
Durability: Messages must never be lost, even if the recipient is offline for 30 days.
Message ordering: Within a conversation, messages must be delivered in send order.
Availability: 99.99% uptime.
Security: Messages should be end-to-end encrypted (E2EE) — we’ll discuss the Signal Protocol.

Out of scope

Voice/video calls (VOIP — separate WebRTC system), payments, bots, channels (broadcast-only, like Telegram Channels).

Step 2 — Capacity estimation

Message volume

23,000 messages/sec average
Peak (evening): ~3× → 69,000 messages/sec

Per message size (stored):
  message_id:      16 bytes (UUID or Snowflake)
  conversation_id: 16 bytes
  sender_id:       8 bytes
  content:         200 bytes (avg text message)
  sent_at:         8 bytes
  ─────────────────────────
  Total:           ~248 bytes/message

23,000 msg/sec × 86,400 sec × 365 days × 248 bytes
≈ 185 TB/year in message storage
→ Must use a storage engine optimized for high write throughput and range scans

Connection count

50M DAU. Assume 40% are online at peak → 20M concurrent WebSocket connections.
Each WS connection: ~65 KB memory overhead (kernel TCP buffer)
20M × 65 KB = 1.3 TB total memory for connections

→ Cannot fit on one server. Need horizontal scaling with a connection registry.

Media storage

Assume 5% of messages contain media, avg 500 KB compressed
23,000 × 0.05 × 500 KB = 575 MB/sec media upload bandwidth
Daily media storage: 575 MB/sec × 86,400 ≈ 48 TB/day
→ Object storage (S3) is mandatory

Step 3 — Real-time communication options

Option A: Short polling

Client asks the server every N seconds: “Any new messages?”

Client: GET /messages?since=last_timestamp    (every 2 seconds)
Server: [empty array]   (most of the time)

Problems: Massive wasted requests (~95% return empty). 50M users × 30 polls/min = 1.5B requests/min. Your servers spend most time answering “no, nothing new.” Not viable.

Option B: Long polling

Client sends a request that the server holds open until a message arrives (or timeout, e.g., 30 seconds).

Client: GET /messages?since=last_timestamp  (server holds connection)
Server: [holds connection for up to 30s]
Server: [sends response when a message arrives or timeout]
Client: [immediately re-establishes long poll]

Better than short polling (fewer wasted responses), but:

HTTP is half-duplex — you can’t push messages while the client is mid-request.
Multiple long-poll connections per user (different browser tabs) create race conditions.
HTTP/1.1 header overhead on every reconnect.
Not suitable for bidirectional flow (e.g., sending while also receiving).

Use case: Long polling is acceptable for notification-style systems (like GitHub PR notifications). Too clunky for chat.

Option C: Server-Sent Events (SSE)

Server holds a persistent HTTP connection and pushes events to the client unidirectionally.

GET /events
Content-Type: text/event-stream

data: {"type":"message","from":"alice","text":"hey"}
data: {"type":"typing","from":"alice"}

Better: True server push, lower overhead than long polling, HTTP/2 multiplexing. Problem: Unidirectional — client must use regular HTTP requests to send messages. Creates an asymmetric architecture. Complex to manage when the client needs to send and receive simultaneously.

Option D: WebSockets (chosen)

Full-duplex persistent TCP connection over HTTP upgrade.

Client → Server: HTTP Upgrade to WebSocket (handshake)
         ─────────────────────────────────────────────
         Persistent bidirectional channel established
         ─────────────────────────────────────────────
Client → Server: {type: "send_message", to: "bob", text: "hey"}
Server → Client: {type: "new_message", from: "alice", text: "hey"}
Server → Client: {type: "delivery_receipt", messageId: "abc", status: "delivered"}

Advantages:

True bidirectional: send and receive on the same connection.
Low overhead after handshake: no HTTP headers per message (just a few bytes framing).
Low latency: server can push instantly without waiting for a client poll.
Natively supported by all browsers and mobile OSes.

Verdict: WebSockets are the right choice for chat. The entire industry — WhatsApp (XMPP over WebSockets), Slack (WebSockets), Discord (WebSockets) — uses this.

Step 4 — WebSocket server architecture

The statefulness problem

A WebSocket connection is stateful — it’s tied to a specific server process. Unlike HTTP where any server can handle any request, a WebSocket connection lives on exactly one server for its entire lifetime.

User Alice  ──WS──►  WS Server 1
User Bob    ──WS──►  WS Server 2
User Carol  ──WS──►  WS Server 1

When Alice sends a message to Bob:

Alice’s message arrives at WS Server 1.
Bob is connected to WS Server 2.
WS Server 1 cannot directly write to Bob’s connection on WS Server 2.

This is the core architectural challenge of horizontal scaling for WebSocket systems.

Horizontal scaling with pub/sub

The solution: decouple the routing from the WebSocket servers using a pub/sub layer.

Alice (WS Server 1) sends message to Bob
    │
    ▼
WS Server 1 publishes to: Redis pub/sub channel "user:bob:inbox"
    │
    ▼
WS Server 2 is subscribed to "user:bob:inbox"
(it knows Bob is connected because of the presence service)
    │
    ▼
WS Server 2 pushes the message over Bob's WebSocket connection

Each WebSocket server subscribes to the inbox channels of all users currently connected to it. When a user connects, the server subscribes to user:{userId}:inbox. When they disconnect, it unsubscribes.

Step 5 — Service registry / presence service

To route a message to the correct WS server, you need a registry that maps userId → wsServerId.

Redis-based presence registry

Key:   presence:{userId}
Value: {
         serverId: "ws-server-7",
         connectedAt: 1705312000,
         lastSeen: 1705312500
       }
TTL:   30 seconds (refreshed by heartbeat)

Heartbeat mechanism: Each WS server sends a heartbeat every 10 seconds for each connected user:

SETEX presence:{userId} 30 {serverId: "ws-server-7", lastSeen: now()}

If a user disconnects unexpectedly (network drop), the TTL expires after 30 seconds. This is when the user’s status flips to “offline” and their “last seen” timestamp is set.

Presence query

When User A opens a chat with User B:

async function isUserOnline(userId: string): Promise<boolean> {
  const presenceData = await redis.get(`presence:${userId}`);
  return presenceData !== null;
}

async function getLastSeen(userId: string): Promise<Date | null> {
  // If online, return null (they're online now)
  // If offline, return from user profile DB
  const presence = await redis.get(`presence:${userId}`);
  if (presence) return null;  // Currently online

  const user = await userDB.get(userId);
  return user.lastSeenAt;
}

Step 6 — Message flow end-to-end

Sending a message (1-1 chat)

Step 1: Alice's client sends over WebSocket:
  {
    "type": "send_message",
    "clientMessageId": "client_abc123",  // idempotency key
    "recipientId": "bob_456",
    "content": "Hey Bob, are you free for lunch?",
    "sentAt": 1705312000123
  }

Step 2: WS Server 1 (Alice's server) processes:
  a. Validate (auth, rate limit, content policy)
  b. Persist message to Cassandra (durable storage)
  c. Send ACK back to Alice: {type: "sent_ack", clientMessageId: "client_abc123",
                               serverMessageId: "msg_snowflake_789"}
  → Alice's client shows single ✓ (sent)

Step 3: WS Server 1 routes to Bob:
  a. Look up Bob's presence: redis.get("presence:bob_456")
  b. If online: publish to Redis pub/sub "user:bob_456:inbox"
  c. WS Server 2 (Bob's server) receives from pub/sub, pushes to Bob's WS
  → Bob receives message, client sends delivery ACK back to server
  → WS Server 2 publishes delivery event to "user:alice_123:inbox"
  → Alice's client receives delivery ACK → shows ✓✓ (delivered)

Step 4: If Bob is offline:
  a. WS Server 1 cannot find Bob in presence registry
  b. Stores message in offline message queue
  c. Sends push notification via FCM/APNs
  → When Bob comes back online, his WS server fetches pending messages

async function handleSendMessage(
  ws: WebSocket,
  payload: SendMessagePayload,
  senderId: string
): Promise<void> {
  const serverId = await redis.get(`presence:${payload.recipientId}`);

  // Persist first — never lose a message
  const message = await messageStore.save({
    conversationId: getConversationId(senderId, payload.recipientId),
    senderId,
    content: payload.content,
    sentAt: new Date(payload.sentAt),
    status: 'sent'
  });

  // ACK to sender
  ws.send(JSON.stringify({
    type: 'sent_ack',
    clientMessageId: payload.clientMessageId,
    serverMessageId: message.id
  }));

  if (serverId) {
    // Recipient is online — route via pub/sub
    await redis.publish(`user:${payload.recipientId}:inbox`, JSON.stringify({
      type: 'new_message',
      message
    }));
  } else {
    // Recipient is offline — queue for delivery + send push notification
    await offlineQueue.enqueue(payload.recipientId, message);
    await pushNotificationService.send(payload.recipientId, {
      title: 'New message',
      body: payload.content.slice(0, 100)
    });
  }
}

Step 7 — Message storage: why not MySQL

The write amplification problem with MySQL

At 23,000 messages/sec, a MySQL messages table with a composite B-tree index on (conversation_id, sent_at) faces:

Write amplification: Every insert updates the B-tree index. At high write rates, this causes severe I/O amplification.
Hotspot rows: Active conversations cause row-level lock contention.
Range scan cost: Fetching the last 50 messages in a conversation requires a B-tree traversal that gets slower as the table grows.
Sharding complexity: MySQL sharding is manual and painful. You’d need to shard by conversation_id and manage the complexity yourself.

Why HBase / Cassandra are ideal for chat

Wide-column stores (HBase, Cassandra, ScyllaDB) are designed for exactly this workload:

Log-structured merge tree (LSM tree): Writes go to an in-memory buffer (memtable) and are flushed to immutable SSTable files on disk. No random writes. Write throughput is ~10× better than B-tree stores.
Natural partitioning: Data is partitioned by a partition key. All rows with the same partition key are co-located. For chat: conversation_id is the partition key — all messages in a conversation are on the same node.
Efficient range scans: Within a partition, rows are sorted by clustering key (message_id). Fetching the last 50 messages is a reverse range scan on a sorted structure — O(1) seek + O(N) scan.
Horizontal scaling: Cassandra has no single point of failure. Add nodes and data rebalances automatically.
High write throughput: A 3-node Cassandra cluster handles 100K+ writes/sec easily.

Step 8 — Message schema and partition key design

-- Cassandra DDL (CQL)
CREATE TABLE messages (
  conversation_id  UUID,
  message_id       BIGINT,     -- Snowflake ID (time-sortable)
  sender_id        BIGINT,
  content          TEXT,
  media_key        TEXT,       -- S3 key if media message
  message_type     TINYINT,    -- 0=text, 1=image, 2=video, 3=file
  status           TINYINT,    -- 0=sent, 1=delivered, 2=read
  sent_at          TIMESTAMP,
  PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND COMPACTION = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

Why message_id as a Snowflake ID (not UUID)?

Snowflake IDs are time-sortable (41-bit timestamp prefix).
CLUSTERING ORDER BY (message_id DESC) means the most recent messages are physically first in the SSTable.
Fetching the last 50 messages is a sequential read from the beginning of the partition — maximally efficient.
A random UUID would scatter inserts throughout the SSTable, defeating the LSM tree’s sequential write optimization.

Why TimeWindowCompactionStrategy?

Chat messages are almost always read within a recent time window (last 7 days). TWCS groups SSTables by time window and compacts within windows. Old messages (30+ days ago) stop being compacted, saving CPU. This matches the access pattern: old messages are rarely accessed.

Partition size consideration: A very active conversation (e.g., 10,000-member Slack channel) could generate millions of rows per partition. Cassandra partitions should be <100MB. Solution: bucket by time:

PRIMARY KEY ((conversation_id, bucket), message_id)
-- bucket = YYYYMMDD (one partition per conversation per day)

Step 9 — Offline message delivery

When a user comes back online after being away:

async function handleUserReconnect(userId: string, ws: WebSocket): Promise<void> {
  // 1. Register presence
  await redis.setex(
    `presence:${userId}`, 30,
    JSON.stringify({ serverId: MY_SERVER_ID, lastSeen: Date.now() })
  );

  // 2. Subscribe to their inbox channel
  await redisSub.subscribe(`user:${userId}:inbox`);

  // 3. Fetch pending offline messages
  const pendingMessages = await offlineQueue.dequeue(userId);

  // 4. Deliver each pending message in order
  for (const msg of pendingMessages) {
    ws.send(JSON.stringify({ type: 'new_message', message: msg }));
    // Send delivery receipt back to the sender
    await publishDeliveryReceipt(msg.senderId, msg.id, 'delivered');
  }
}

Offline queue design

Use a dedicated Redis List or a lightweight DB table:

CREATE TABLE offline_messages (
  recipient_id  BIGINT UNSIGNED NOT NULL,
  message_id    BIGINT          NOT NULL,
  conversation_id UUID          NOT NULL,
  queued_at     DATETIME(3)     NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  PRIMARY KEY (recipient_id, message_id)
);

On reconnect: SELECT * FROM offline_messages WHERE recipient_id = ? ORDER BY message_id ASC LIMIT 1000.

Then delete the rows (they’re now delivered). The actual message content is fetched from Cassandra using message_id — the offline queue is just a list of IDs, not a copy of the messages.

Step 10 — Delivery receipts

WhatsApp’s three-state receipt system is a great design to discuss:

State	Icon	Meaning
Sent	✓ (grey)	Message persisted on server
Delivered	✓✓ (grey)	Message pushed to recipient’s device
Read	✓✓ (blue)	Recipient opened the conversation

Implementation

// Message status in Cassandra
// Update when delivery events occur:

// 1. On server persistence (after INSERT):
UPDATE messages SET status = 0 WHERE conversation_id = ? AND message_id = ?;

// 2. On delivery to recipient's device:
UPDATE messages SET status = 1 WHERE conversation_id = ? AND message_id = ?;
// Publish delivery_receipt to sender's inbox channel

// 3. On read (client sends READ event when conversation is opened):
UPDATE messages SET status = 2 WHERE conversation_id = ?
  AND message_id <= ? AND sender_id != ?;  -- mark all prior messages as read
// Publish read_receipt to sender's inbox channel

// Client-side: send a READ receipt when the conversation is opened
function onConversationOpen(conversationId: string, lastSeenMessageId: bigint): void {
  ws.send(JSON.stringify({
    type: 'read_receipt',
    conversationId,
    upToMessageId: lastSeenMessageId  // mark everything up to here as read
  }));
}

Privacy consideration: WhatsApp allows users to disable read receipts. When disabled: the server does not send read receipts to senders, but you still track the read state internally (for your own “last seen” feature). The difference is purely in what gets shared.

Step 11 — Group chat fanout

Group chats (up to 500 members) require fan-out to all online members and offline queuing for offline members.

Group membership schema

-- MySQL (not Cassandra — group membership is relational)
CREATE TABLE group_members (
  group_id    BIGINT UNSIGNED NOT NULL,
  user_id     BIGINT UNSIGNED NOT NULL,
  joined_at   DATETIME(3)     NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  role        ENUM('member','admin','owner') NOT NULL DEFAULT 'member',
  PRIMARY KEY (group_id, user_id),
  KEY idx_user_groups (user_id)
);

CREATE TABLE groups (
  id          BIGINT UNSIGNED NOT NULL,
  name        VARCHAR(255)    NOT NULL,
  icon_url    VARCHAR(512),
  created_by  BIGINT UNSIGNED NOT NULL,
  created_at  DATETIME(3)     NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  member_count INT            NOT NULL DEFAULT 1,
  PRIMARY KEY (id)
);

Group message delivery flow

async function deliverGroupMessage(
  message: Message,
  groupId: string
): Promise<void> {
  // Fetch all group members (up to 500)
  const members = await groupMemberDB.query(
    'SELECT user_id FROM group_members WHERE group_id = ?', [groupId]
  );

  // Fan-out to all members concurrently
  await Promise.all(members.map(async ({ userId }) => {
    if (userId === message.senderId) return;  // Don't echo to sender

    const presence = await redis.get(`presence:${userId}`);
    if (presence) {
      // Online: route via pub/sub
      await redis.publish(`user:${userId}:inbox`, JSON.stringify({
        type: 'new_group_message',
        message
      }));
    } else {
      // Offline: queue + push notification
      await offlineQueue.enqueue(userId, message);
    }
  }));
}

500 members fanout latency: 500 × 1ms Redis operation (pipelined) ≈ 0.5ms. Acceptable.

For larger groups (Slack channels with 10,000+ members): use a dedicated async fanout worker. The WS server publishes {type: "group_message", groupId, messageId} to Kafka. A separate fanout service consumes and handles the distribution.

Message storage for group chats

Same Cassandra schema — conversation_id is the group’s conversation ID. All 500 members query the same partition to load history. No duplication.

conversation_id = group_{groupId}  (same partition for all members)

Upload flow:
┌─────────┐   1. Request pre-signed URL   ┌──────────────┐
│ Client  │──────────────────────────────►│ Media Service│
│         │◄──────────────────────────────│              │
│         │   2. Pre-signed S3 PUT URL     └──────────────┘
│         │                                      │
│         │   3. Upload directly to S3           │
│         │──────────────────────────────────────► S3
│         │                                      │
│         │   4. Send message with S3 key        │
│         │──────────────────────────────►  WS Server
└─────────┘                                      │
                                     Lambda (S3 trigger)
                                          │
                                    Media Processor
                                    - Generate thumbnail
                                    - Compress/resize image
                                    - Transcode video (HLS)
                                    - Virus scan
                                          │
                                    CDN (CloudFront)
                                    media.bck.ly/{key}

Security: Pre-signed URLs expire in 15 minutes. After expiry, the S3 object is still accessible via CDN (which has its own cache). The pre-signed URL is only for upload — not download. Download goes through CDN.

Media message in Cassandra: Store only the S3 key (not the binary data). The client constructs the full CDN URL: https://media.bck.ly/{mediaKey}.

Step 13 — End-to-end encryption (Signal Protocol)

E2EE means the server never sees plaintext message content — only the sender and recipient can decrypt.

Key concepts (not implementation)

Key exchange (Diffie-Hellman):

Alice and Bob each have a public/private key pair. They exchange public keys (the server can see these — they’re public). Each computes a shared secret using their private key and the other’s public key. Crucially: the server, which only sees both public keys, cannot compute the shared secret.

Alice has: (alicePrivate, alicePublic)
Bob has:   (bobPrivate, bobPublic)

Alice computes: sharedSecret = DH(alicePrivate, bobPublic)
Bob computes:   sharedSecret = DH(bobPrivate, alicePublic)
Both get the same sharedSecret, server never sees it.

Forward secrecy (Double Ratchet Algorithm):

The Signal Protocol uses a “double ratchet” — the encryption key changes for every single message. Even if an attacker captures all your encrypted messages today and later compromises your private key, they cannot decrypt past messages because those message keys are already deleted.

Pre-keys:

What if Bob is offline and Alice wants to send a message? Bob pre-uploads a batch of one-time pre-keys to the server before going offline. Alice uses one of these to compute the shared secret and encrypt the message. When Bob comes back, he uses his private pre-key to decrypt.

What the server stores (with E2EE):

Encrypted ciphertext (meaningless without the key)
Public keys and pre-keys (public by definition)
Metadata: sender, recipient, timestamp, message length

The server handles routing and storage but is cryptographically excluded from reading content.

Step 14 — Complete architecture diagram

                    ┌────────────────────────────────────────────┐
                    │           CLIENTS (iOS/Android/Web)         │
                    └──────────────┬──────────────┬──────────────┘
                                   │  WebSocket   │  HTTP (media upload)
                    ┌──────────────▼──────────────▼──────────────┐
                    │          Load Balancer (L4 — TCP)            │
                    │     (sticky sessions not required:           │
                    │      WS servers are coordinated via Redis)   │
                    └──────┬───────────────────────┬─────────────┘
                           │                       │
           ┌───────────────▼────┐    ┌─────────────▼───────────┐
           │  WebSocket Server 1 │    │  WebSocket Server 2      │
           │  (Alice connected)  │    │  (Bob connected)         │
           │  - Pub/sub handler  │    │  - Pub/sub handler       │
           └──────┬─────────────┘    └─────────────┬───────────┘
                  │                                 │
     ┌────────────▼─────────────────────────────────▼────────────┐
     │                   Redis Cluster                             │
     │  Pub/Sub channels:  user:{userId}:inbox                    │
     │  Presence:          presence:{userId} (TTL 30s)            │
     │  Offline queue:     (list, key=userId)                     │
     └────────────┬────────────────────────────────┬─────────────┘
                  │                                │
     ┌────────────▼──────────┐     ┌───────────────▼────────────┐
     │   Message Service      │     │  Group Service              │
     │   (persist to Cassandra│     │  (group membership MySQL)   │
     │    + route messages)   │     │  (fanout to members)        │
     └────────────┬──────────┘     └────────────────────────────┘
                  │
     ┌────────────▼──────────────────────────────────────────────┐
     │              Apache Cassandra Cluster                       │
     │   Keyspace: chat                                           │
     │   Table: messages (partition: conversation_id)             │
     │   3-node cluster, RF=3, quorum reads/writes                │
     └───────────────────────────────────────────────────────────┘

     ┌───────────────────────────────────────────────────────────┐
     │                OFFLINE DELIVERY PIPELINE                   │
     │  WS Server → Redis offline queue → Push Service           │
     │                                    → FCM (Android)        │
     │                                    → APNs (iOS)           │
     │  On reconnect: dequeue → deliver → delete from queue      │
     └───────────────────────────────────────────────────────────┘

     ┌───────────────────────────────────────────────────────────┐
     │                   MEDIA PIPELINE                           │
     │  Client → Pre-signed URL → S3 upload                      │
     │  S3 event → Lambda → Media Processor (resize/transcode)   │
     │  CDN (CloudFront) serves media.bck.ly/{key}               │
     └───────────────────────────────────────────────────────────┘

Step 15 — Common follow-up questions

Q: How do you implement typing indicators without storing anything?

Typing indicators are ephemeral — they don’t need persistence or Cassandra.

// Client sends when user starts/stops typing
ws.send(JSON.stringify({
  type: 'typing_indicator',
  conversationId: 'conv_abc',
  isTyping: true
}));

// Server receives and immediately forwards via pub/sub — no storage
async function handleTypingIndicator(payload: TypingPayload, senderId: string): Promise<void> {
  const recipientPresence = await redis.get(`presence:${payload.recipientId}`);
  if (!recipientPresence) return;  // Recipient offline — don't bother

  await redis.publish(`user:${payload.recipientId}:inbox`, JSON.stringify({
    type: 'typing_indicator',
    from: senderId,
    conversationId: payload.conversationId,
    isTyping: payload.isTyping
  }));
  // That's it. No DB write. No Cassandra.
}

Client auto-stops the typing indicator after 5 seconds of inactivity (client-side timer). If the sender disconnects, the indicator just… stops. No cleanup needed.

Q: How do you implement “last seen” timestamp?

When a user disconnects:

async function handleDisconnect(userId: string): Promise<void> {
  const lastSeen = new Date();
  // 1. Remove from presence registry
  await redis.del(`presence:${userId}`);
  // 2. Persist last seen to user profile DB
  await userDB.execute(
    'UPDATE users SET last_seen_at = ? WHERE id = ?',
    [lastSeen, userId]
  );
}

When fetching last seen for display:

If presence:{userId} exists → show “Online”.
Otherwise → fetch last_seen_at from user table → show “Last seen at 2:34 PM”.

Privacy: Users can opt out of sharing last seen. When opted out: the server still records last_seen_at (for the user’s own “clear history” features) but returns null to other users’ queries.

Q: How do you implement message search?

Cassandra is not designed for full-text search (it has no inverted index). For message search:

Async indexing pipeline: Every message written to Cassandra also publishes to Kafka → Elasticsearch consumer indexes the message content.
Search service: Queries Elasticsearch with conversation_id filter + full-text query.
Result hydration: Fetch actual message objects from Cassandra using the returned message_id list.

Elasticsearch handles the inverted index, relevance ranking, and fuzzy matching. Cassandra handles the durable storage and retrieval by ID. Never mix the two concerns.

Q: How do you handle message ordering when two users send at the same millisecond?

Snowflake IDs include a per-worker sequence number (12 bits = 4096 IDs/ms). Two messages sent at the same millisecond from the same server get different sequence numbers → different Snowflake IDs → deterministic ordering.

If messages arrive from different servers at the same millisecond, the Snowflake IDs from different workers may overlap. Solution: at read time, break ties by (sent_at, sender_id) — a consistent secondary sort that’s stable even with clock skew.

For causal ordering (ensure “message B was sent after seeing message A”), include a replyToId or causalClock field. Clients display messages in Snowflake ID order (close enough to causal order for most chat scenarios).

Q: What happens to the 20M open WebSocket connections if you need to deploy a new version?

Rolling deployment with graceful drain:

Stop sending new connections to the server being replaced (remove from load balancer).
Send a {"type": "server_shutdown", "reconnectIn": 5} message to all connected clients.
Clients receive this and schedule a reconnect in 5 seconds (with jitter).
Wait for all connections to drain (max 30 seconds).
Terminate the server process.
Clients reconnect to a different server (load balancer picks a new one).
New server registers their presence, fetches offline messages if any arrived during the 5-second gap.

The gap in delivery during the 5-second reconnect is covered by the offline message queue — any messages that arrived during reconnect are queued and delivered immediately on reconnection.