System Design · Topic 4 of 16

SQL vs NoSQL

100 XP

Relational databases (SQL)

Data is stored in tables with rows and columns. Relationships between tables are expressed via foreign keys. Queries use SQL.

-- Users and their orders — normalized across two tables
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id

ACID guarantees:

  • Atomicity: a transaction either fully succeeds or fully rolls back — no partial writes
  • Consistency: the database always moves from one valid state to another
  • Isolation: concurrent transactions don’t see each other’s intermediate state
  • Durability: committed data survives crashes

When to use SQL:

  • Structured data with well-defined schema
  • Complex queries and reporting
  • Transactional integrity is required (payments, inventory, bookings)
  • Relationships between entities matter
  • You need JOINs

Popular options: PostgreSQL (default choice), MySQL, SQLite (embedded).


NoSQL databases

“Not Only SQL” — a broad category of databases that don’t use the relational model. They trade some ACID guarantees for scalability, flexibility, or performance in specific access patterns.

Document stores

Store data as JSON-like documents. Each document is self-contained.

{
  "_id": "user_123",
  "name": "Deveshwar",
  "orders": [
    { "id": "ord_1", "total": 499, "items": ["item_a", "item_b"] }
  ]
}

Good for: Nested/variable data, content management, user profiles, any data that reads as a unit. Examples: MongoDB, Firestore.

Key-value stores

Simplest possible model: key → value. Extremely fast, especially in-memory.

Good for: Sessions, caching, feature flags, configuration. Examples: Redis, DynamoDB (also supports more).

Wide-column stores

Rows with flexible columns, optimized for write-heavy, append-mostly workloads.

Good for: Time-series data, event logs, analytics at massive scale. Examples: Cassandra, HBase.

Graph databases

Nodes and edges. Optimized for traversing relationships.

Good for: Social networks, recommendation engines, fraud detection. Examples: Neo4j.


Indexing

Without an index, a query scans every row in the table — O(n). With an index, it jumps directly to matching rows — O(log n) with a B-tree index.

-- Without index: full table scan
SELECT * FROM users WHERE email = 'dev@example.com';  -- scans all rows

-- Create index
CREATE INDEX idx_users_email ON users(email);

-- Now uses the index: O(log n)
SELECT * FROM users WHERE email = 'dev@example.com';

B-tree indexes (the default in PostgreSQL, MySQL):

  • Good for equality (=) and range queries (>, <, BETWEEN)
  • Ordered — also accelerates ORDER BY
  • Most useful for high-cardinality columns (many distinct values)

Hash indexes:

  • Equality only — no ranges
  • O(1) lookup instead of O(log n)
  • Rare in practice (B-tree is usually fast enough and more versatile)

Composite indexes: Index on multiple columns.

CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
-- Efficient for: WHERE user_id = 123 AND created_at > '2026-01-01'
-- Follows leftmost prefix rule — user_id alone also uses this index

The downside of indexes: Every write must update the index. Heavily-indexed tables have slower writes. Don’t index everything — index the columns in your WHERE and JOIN conditions.


Replication

Running a single database server is a single point of failure and a read bottleneck. Replication copies data to additional servers.

Leader-follower (master-replica)

Writes → [ Leader ] → propagate to followers
Reads  → [ Follower 1 ] [ Follower 2 ] [ Follower 3 ]

All writes go to the leader. Reads can go to any follower. Followers may lag behind the leader by milliseconds to seconds (replication lag).

Pros: Simple. Read scaling (more followers = more read capacity). Follower can be promoted if leader fails.

Cons: Writes don’t scale (all go to one leader). Replication lag means followers may return stale data. Failover is manual or needs extra tooling.

Multi-leader (active-active)

Multiple leaders accept writes. Changes propagate between leaders.

Pros: Write scaling. Survives single leader failure.

Cons: Write conflicts — same record updated on two leaders simultaneously. Conflict resolution is application-specific and hard to get right. Avoid unless you have a clear use case (multi-datacenter active-active).

Leaderless (quorum)

Write to N nodes; a write succeeds when W nodes acknowledge it. Read from R nodes; return the value if a quorum agrees.

With N=3, W=2, R=2: you can tolerate one node failure for both reads and writes, and R+W > N ensures you always read at least one node with the latest write.

Examples: Cassandra, DynamoDB. Used for high availability + partition tolerance.


Sharding (horizontal partitioning)

When a single database can’t hold all the data, split it across multiple database servers (shards).

Range-based sharding: Users A-M on shard 1, N-Z on shard 2. Simple, but hotspots if the data isn’t uniform.

Hash-based sharding: shard = hash(user_id) % N. Even distribution, but range queries must hit all shards.

Directory-based sharding: A lookup service maps keys to shards. Flexible (add shards without rehashing), but the directory is a bottleneck.

The downside: Cross-shard JOINs are expensive or impossible. You design your data model around the shard key. Schema migrations become much harder.


SQL vs NoSQL: the decision

Choose SQL whenChoose NoSQL when
Data has clear relationshipsAccess patterns are known and simple
Transactions matterSchema is flexible or varies per record
Complex queries and reportingNeed to scale writes horizontally (NoSQL shards naturally)
Team knows SQLSpecific access pattern maps to a NoSQL model
You’re unsure (default)Massive scale with uniform access pattern

Default to PostgreSQL. Switch to NoSQL when you have a specific, concrete reason — not because “it scales better.” Modern PostgreSQL with proper indexing and read replicas handles most workloads comfortably.