What is a load balancer?
A load balancer sits in front of a pool of servers and distributes incoming requests across them. From the outside, clients see one address. Behind it, the work is spread across many machines.
[ Client ]
↓
[ Load Balancer ]
/ | \
[S1] [S2] [S3]
Without a load balancer, you could only scale vertically (one bigger server). With one, you scale horizontally (many servers, load distributed).
Load balancing algorithms
Round robin
Distribute requests in order: S1, S2, S3, S1, S2, S3, …
Simple and fair when all servers are identical and requests are similarly weighted. Falls apart when some requests are much heavier than others — one server gets the expensive requests by chance and falls behind.
Weighted round robin
Same as round robin but servers with more capacity get more requests. S1 gets 3 requests for every 1 that S3 gets.
Use when servers have different hardware specs.
Least connections
Route each new request to the server with the fewest active connections.
Better than round robin for variable-length requests (some requests hold connections open longer). The load balancer needs to track active connection counts, which adds overhead.
IP hash (sticky sessions)
Hash the client’s IP address to always route them to the same server.
Useful for stateful applications where session data lives on the server. Downside: if a server goes down, those users lose their session. Also, IP addresses behind a NAT all hash the same way — one server gets all of them.
Least response time
Route to the server with the lowest average response time and fewest connections. The most accurate measure of actual server load but requires the load balancer to actively monitor latency.
Layer 4 vs Layer 7
Layer 4 (transport layer): Routes based on IP address and TCP/UDP port. The load balancer doesn’t look inside the packet — it just forwards traffic. Very fast, low overhead. Can’t make routing decisions based on content.
Layer 7 (application layer): Routes based on the actual content of the request — URL path, HTTP headers, cookies. Slower (must parse the request), but much more powerful:
- Route
/api/*to API servers,/static/*to CDN - Route authenticated users to one pool, unauthenticated to another
- A/B testing: route 10% of traffic to new servers
- SSL termination (decrypt once at the load balancer, plain HTTP to servers)
Most modern systems use Layer 7 load balancing (Nginx, HAProxy, AWS ALB).
Health checks
A load balancer periodically sends requests to each server to check if it’s alive. If a server fails health checks, the load balancer removes it from the pool. When it recovers, it’s added back.
Every 30s:
Load Balancer → GET /health → Server
Server → 200 OK → Load Balancer ✓
Server → timeout / 500 → Load Balancer ✗ (remove from pool)
The health check endpoint should verify what actually matters: can the server connect to its database? Is memory not full? A server that returns 200 on /health but has a full disk is still broken.
The load balancer as a single point of failure
The load balancer itself can fail. If it does, everything behind it becomes unreachable.
Solution: redundant load balancers with failover.
DNS → VIP (virtual IP)
↓
[ LB Primary ] ←→ [ LB Standby ]
(active) (takes over if primary fails)
↓
[ Server Pool ]
The virtual IP floats between the primary and standby. If the primary dies, the standby announces the VIP. Clients see no interruption (DNS doesn’t change). This is standard in production setups (keepalived, AWS ELB handles this automatically).
Autoscaling
Real production systems don’t have a fixed server pool. They scale the pool up and down based on load:
- Load balancer + monitoring detects CPU > 70% across pool
- Autoscaler provisions 2 new servers
- New servers pass health checks
- Load balancer adds them to the pool
- Load drops → scale back down after a cooldown period
AWS Auto Scaling Groups, GCP Managed Instance Groups, and Kubernetes HPA all work this way.
The load balancer is the traffic router. The autoscaler is the capacity manager. They work together.
Interview pattern
When an interviewer says “the system needs to handle 10× traffic”, the answer almost always involves:
- Stateless application servers — so you can scale them horizontally
- Load balancer — to distribute traffic across those servers
- Externalized state (Redis/DB) — so any server can handle any request
- Health checks + autoscaling — for reliability and cost efficiency
The load balancer is the enabler. Everything else (caching, CDN, DB replication) is optimization on top.