Load Balancing — ZERO TO GRIND

What is a load balancer?

A load balancer sits in front of a pool of servers and distributes incoming requests across them. From the outside, clients see one address. Behind it, the work is spread across many machines.

           [ Client ]
               ↓
        [ Load Balancer ]
        /       |        \
   [S1]       [S2]       [S3]

Without a load balancer, you could only scale vertically (one bigger server). With one, you scale horizontally (many servers, load distributed).

Load balancing algorithms

Round robin

Distribute requests in order: S1, S2, S3, S1, S2, S3, …

Simple and fair when all servers are identical and requests are similarly weighted. Falls apart when some requests are much heavier than others — one server gets the expensive requests by chance and falls behind.

Weighted round robin

Same as round robin but servers with more capacity get more requests. S1 gets 3 requests for every 1 that S3 gets.

Use when servers have different hardware specs.

Least connections

Route each new request to the server with the fewest active connections.

Better than round robin for variable-length requests (some requests hold connections open longer). The load balancer needs to track active connection counts, which adds overhead.

IP hash (sticky sessions)

Hash the client’s IP address to always route them to the same server.

Useful for stateful applications where session data lives on the server. Downside: if a server goes down, those users lose their session. Also, IP addresses behind a NAT all hash the same way — one server gets all of them.

Least response time

Route to the server with the lowest average response time and fewest connections. The most accurate measure of actual server load but requires the load balancer to actively monitor latency.

Layer 4 vs Layer 7

Layer 4 (transport layer): Routes based on IP address and TCP/UDP port. The load balancer doesn’t look inside the packet — it just forwards traffic. Very fast, low overhead. Can’t make routing decisions based on content.

Layer 7 (application layer): Routes based on the actual content of the request — URL path, HTTP headers, cookies. Slower (must parse the request), but much more powerful:

Route /api/* to API servers, /static/* to CDN
Route authenticated users to one pool, unauthenticated to another
A/B testing: route 10% of traffic to new servers
SSL termination (decrypt once at the load balancer, plain HTTP to servers)

Most modern systems use Layer 7 load balancing (Nginx, HAProxy, AWS ALB).

Health checks

A load balancer periodically sends requests to each server to check if it’s alive. If a server fails health checks, the load balancer removes it from the pool. When it recovers, it’s added back.

Every 30s:
Load Balancer → GET /health → Server
Server → 200 OK → Load Balancer ✓
Server → timeout / 500 → Load Balancer ✗ (remove from pool)

The health check endpoint should verify what actually matters: can the server connect to its database? Is memory not full? A server that returns 200 on /health but has a full disk is still broken.

The load balancer as a single point of failure

The load balancer itself can fail. If it does, everything behind it becomes unreachable.

Solution: redundant load balancers with failover.

DNS → VIP (virtual IP)
        ↓
  [ LB Primary ] ←→ [ LB Standby ]
  (active)            (takes over if primary fails)
        ↓
   [ Server Pool ]

The virtual IP floats between the primary and standby. If the primary dies, the standby announces the VIP. Clients see no interruption (DNS doesn’t change). This is standard in production setups (keepalived, AWS ELB handles this automatically).

Autoscaling

Real production systems don’t have a fixed server pool. They scale the pool up and down based on load:

Load balancer + monitoring detects CPU > 70% across pool
Autoscaler provisions 2 new servers
New servers pass health checks
Load balancer adds them to the pool
Load drops → scale back down after a cooldown period

AWS Auto Scaling Groups, GCP Managed Instance Groups, and Kubernetes HPA all work this way.

The load balancer is the traffic router. The autoscaler is the capacity manager. They work together.

Interview pattern

When an interviewer says “the system needs to handle 10× traffic”, the answer almost always involves:

Stateless application servers — so you can scale them horizontally
Load balancer — to distribute traffic across those servers
Externalized state (Redis/DB) — so any server can handle any request
Health checks + autoscaling — for reliability and cost efficiency

The load balancer is the enabler. Everything else (caching, CDN, DB replication) is optimization on top.