System Design · Topic 4 of 18

SQL vs NoSQL

100 XP

Relational databases (SQL)

Data is stored in tables with rows and columns. Relationships between tables are expressed via foreign keys. Queries use SQL.

-- Users and their orders — normalized across two tables
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id

ACID guarantees:

  • Atomicity: a transaction either fully succeeds or fully rolls back — no partial writes
  • Consistency: the database always moves from one valid state to another
  • Isolation: concurrent transactions don't see each other's intermediate state
  • Durability: committed data survives crashes

When to use SQL:

  • Structured data with well-defined schema
  • Complex queries and reporting
  • Transactional integrity is required (payments, inventory, bookings)
  • Relationships between entities matter
  • You need JOINs

Popular options: PostgreSQL (default choice), MySQL, SQLite (embedded).


NoSQL databases

"Not Only SQL" — a broad category of databases that don't use the relational model. They trade some ACID guarantees for scalability, flexibility, or performance in specific access patterns.

Document stores

Store data as JSON-like documents. Each document is self-contained.

{
  "_id": "user_123",
  "name": "Deveshwar",
  "orders": [
    { "id": "ord_1", "total": 499, "items": ["item_a", "item_b"] }
  ]
}

Good for: Nested/variable data, content management, user profiles, any data that reads as a unit. Examples: MongoDB, Firestore.

Key-value stores

Simplest possible model: key → value. Extremely fast, especially in-memory.

Good for: Sessions, caching, feature flags, configuration. Examples: Redis, DynamoDB (also supports more).

Wide-column stores

Rows with flexible columns, optimized for write-heavy, append-mostly workloads.

Good for: Time-series data, event logs, analytics at massive scale. Examples: Cassandra, HBase.

Graph databases

Nodes and edges. Optimized for traversing relationships.

Good for: Social networks, recommendation engines, fraud detection. Examples: Neo4j.


Indexing

Without an index, a query scans every row in the table — O(n). With an index, it jumps directly to matching rows — O(log n) with a B-tree index.

-- Without index: full table scan
SELECT * FROM users WHERE email = 'dev@example.com';  -- scans all rows

-- Create index
CREATE INDEX idx_users_email ON users(email);

-- Now uses the index: O(log n)
SELECT * FROM users WHERE email = 'dev@example.com';

B-tree indexes (the default in PostgreSQL, MySQL):

  • Good for equality (=) and range queries (>, <, BETWEEN)
  • Ordered — also accelerates ORDER BY
  • Most useful for high-cardinality columns (many distinct values)

Hash indexes:

  • Equality only — no ranges
  • O(1) lookup instead of O(log n)
  • Rare in practice (B-tree is usually fast enough and more versatile)

Composite indexes: Index on multiple columns.

CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
-- Efficient for: WHERE user_id = 123 AND created_at > '2026-01-01'
-- Follows leftmost prefix rule — user_id alone also uses this index

The downside of indexes: Every write must update the index. Heavily-indexed tables have slower writes. Don't index everything — index the columns in your WHERE and JOIN conditions.


Replication

Running a single database server is a single point of failure and a read bottleneck. Replication copies data to additional servers.

Leader-follower (master-replica)

Writes → [ Leader ] → propagate to followers
Reads  → [ Follower 1 ] [ Follower 2 ] [ Follower 3 ]

All writes go to the leader. Reads can go to any follower. Followers may lag behind the leader by milliseconds to seconds (replication lag).

Pros: Simple. Read scaling (more followers = more read capacity). Follower can be promoted if leader fails.

Cons: Writes don't scale (all go to one leader). Replication lag means followers may return stale data. Failover is manual or needs extra tooling.

Multi-leader (active-active)

Multiple leaders accept writes. Changes propagate between leaders.

Pros: Write scaling. Survives single leader failure.

Cons: Write conflicts — same record updated on two leaders simultaneously. Conflict resolution is application-specific and hard to get right. Avoid unless you have a clear use case (multi-datacenter active-active).

Leaderless (quorum)

Write to N nodes; a write succeeds when W nodes acknowledge it. Read from R nodes; return the value if a quorum agrees.

With N=3, W=2, R=2: you can tolerate one node failure for both reads and writes, and R+W > N ensures you always read at least one node with the latest write.

Examples: Cassandra, DynamoDB. Used for high availability + partition tolerance.


Sharding (horizontal partitioning)

When a single database can't hold all the data, split it across multiple database servers (shards).

Range-based sharding: Users A-M on shard 1, N-Z on shard 2. Simple, but hotspots if the data isn't uniform.

Hash-based sharding: shard = hash(user_id) % N. Even distribution, but range queries must hit all shards.

Directory-based sharding: A lookup service maps keys to shards. Flexible (add shards without rehashing), but the directory is a bottleneck.

The downside: Cross-shard JOINs are expensive or impossible. You design your data model around the shard key. Schema migrations become much harder.


SQL vs NoSQL: the decision

| Choose SQL when | Choose NoSQL when | |-----------------|-------------------| | Data has clear relationships | Access patterns are known and simple | | Transactions matter | Schema is flexible or varies per record | | Complex queries and reporting | Need to scale writes horizontally (NoSQL shards naturally) | | Team knows SQL | Specific access pattern maps to a NoSQL model | | You're unsure (default) | Massive scale with uniform access pattern |

Default to PostgreSQL. Switch to NoSQL when you have a specific, concrete reason — not because "it scales better." Modern PostgreSQL with proper indexing and read replicas handles most workloads comfortably.