What is scalability?
A system is scalable if it can handle increased load without a redesign. Load might mean more users, more requests per second, more data stored, or larger payloads.
Scalability is not the same as performance. A fast system that falls over at 10× traffic is not scalable. A slow system that handles 10× traffic gracefully is scalable (just also slow).
The goal is to design systems where adding resources translates predictably into added capacity.
Vertical scaling (scale up)
Add more power to the existing machine: more CPU cores, more RAM, faster disks.
Before: After:
[ Server 8GB ] [ Server 64GB ]
Pros:
- Simple — no application changes required
- No distributed systems complexity
- Consistent (single node = no network partitions, no sync issues)
Cons:
- Hard limit — you can only buy so much CPU/RAM
- Single point of failure
- Expensive beyond a certain point (non-linear cost curve)
- Requires downtime to upgrade hardware
When to use: Early-stage products, databases (where horizontal scaling is hard), stateful services where distribution is painful.
Horizontal scaling (scale out)
Add more machines, distribute the load across them.
Before: After:
[ Server ] [ Server ] [ Server ] [ Server ]
↑
Load Balancer
Pros:
- Theoretically unlimited scale
- No single point of failure (if designed right)
- Can use commodity hardware
Cons:
- Distributed systems complexity: consistency, partitioning, coordination
- State management becomes hard — which server holds the user’s session?
- Network latency between nodes adds up
When to use: Stateless services (API servers, web servers), read-heavy workloads, anything where requests are independent.
Stateless vs stateful
This is the key to horizontal scaling. A stateless server treats every request independently — it doesn’t remember anything between requests. Any server can handle any request. This makes horizontal scaling trivially achievable.
A stateful server remembers things between requests (sessions, open connections). If user A’s session is on server 1 and the load balancer sends their next request to server 2, it fails.
Solutions for stateful services:
- Sticky sessions: load balancer always routes the same user to the same server. Works, but defeats much of the scaling benefit.
- Externalize state: move session data to Redis or a database. All servers share the state store. This is the standard approach.
The stateless/stateful split
A well-designed system separates concerns:
[ Client ]
↓
[ Load Balancer ]
↓
[ API Servers — STATELESS, many of them ]
↓
[ Database / Cache — STATEFUL, scaled separately ]
The application tier is stateless and scales horizontally. The persistence tier is stateful and scaled using database-specific techniques (replication, sharding).
Latency vs throughput
Two key metrics:
- Latency: time to handle one request (ms). “How fast is each operation?”
- Throughput: requests handled per unit time (req/s). “How much total work can the system do?”
They’re related but not the same. A system can have low latency and low throughput (fast but not parallelized), or high throughput and high latency (slow per request but handles many concurrently).
| Goal | Optimization |
|---|---|
| Lower latency | Caching, faster algorithms, geographic distribution |
| Higher throughput | Horizontal scaling, async processing, parallelism |
Estimating scale
Before designing, anchor on numbers. Common rough figures for interviews:
| Metric | Rough value |
|---|---|
| Read QPS a single server handles | ~10,000 req/s |
| Write QPS a single database handles | ~1,000 writes/s |
| 1 day of seconds | ~86,400 |
| 1 year of seconds | ~31.5 million |
| Average tweet size | ~300 bytes |
| 1 GB | 10^9 bytes |
Example estimate: Twitter with 100M daily active users, each user seeing 100 tweets/day.
- Total tweet views: 10^7 × 100 = 10^9 per day
- QPS: 10^9 / 86,400 ≈ 11,500 reads/sec
- Peak (assume 3× average): ~35,000 reads/sec
- Conclusion: you definitely need horizontal scaling, caching is critical
Getting the order of magnitude right matters more than the exact number.