System Design · Topic 7 of 16

API Design

100 XP

Why API Design Is a Force Multiplier

An API is a contract. Once published and consumed by clients you don’t control, breaking that contract costs real money — forced migrations, outage windows, angry partners. A well-designed API, by contrast, is self-documenting, evolvable, and reduces the surface area of integration bugs.

This is why API design decisions get escalated to staff engineers. The code you write to implement an endpoint takes hours. The API shape you define will constrain your system for years.


REST: Constraints That Matter

REST (Representational State Transfer) is not just “JSON over HTTP.” Roy Fielding’s dissertation defines six architectural constraints. Only two are commonly violated in practice:

  1. Stateless — each request contains all information needed to process it. No server-side session. This is what makes REST horizontally scalable.
  2. Uniform Interface — standardised interaction via resources, HTTP methods, and hypermedia (HATEOAS). This is what makes REST discoverable.

The other four (client-server, cacheable, layered system, code-on-demand) are generally satisfied by any web API.

Richardson Maturity Model

A practical way to assess how “RESTful” an API really is:

Level 0: HTTP as a tunnel
  POST /api
  {"action": "getUserOrders", "userId": 42}

Level 1: Resources (URLs represent nouns)
  POST /users/42/orders
  POST /users/42/orders/cancel   ← still using verbs

Level 2: HTTP Verbs + Status Codes used correctly
  GET    /users/42/orders         → 200
  POST   /users/42/orders         → 201
  DELETE /users/42/orders/99      → 204
  GET    /users/42/orders/999999  → 404

Level 3: HATEOAS (Hypermedia Controls)
  GET /users/42/orders/99
  {
    "orderId": 99,
    "status": "shipped",
    "_links": {
      "self":   { "href": "/users/42/orders/99" },
      "cancel": { "href": "/users/42/orders/99/cancel", "method": "DELETE" },
      "track":  { "href": "/shipments/TRK-abc123" }
    }
  }

Most production APIs operate at Level 2. Level 3 is rare but powerful for APIs consumed by generic clients (like HAL browsers). The links eliminate the need for clients to construct URLs — they follow links like a web browser.


Resource Naming Best Practices

# Use plural nouns for collections
GET  /users          ← collection
GET  /users/42       ← single resource
POST /users          ← create in collection

# Hierarchy for owned resources
GET  /users/42/orders        ← user's orders
GET  /users/42/orders/99     ← specific order

# Avoid deep nesting (> 2 levels gets unwieldy)
# Bad:
GET /companies/1/departments/2/teams/3/members/4

# Better: flatten with query params for context
GET /team-members/4?teamId=3

# Actions that don't map to CRUD: use sub-resource verbs sparingly
POST /orders/99/cancel    ← acceptable for state transitions
POST /payments/capture    ← acceptable for operations

# Query parameters for filtering, sorting, projection
GET /orders?status=shipped&sort=-createdAt&fields=id,status,total

Naming conventions: kebab-case for URLs (/user-profiles), camelCase for JSON fields (userId, createdAt). Be consistent — inconsistency is the #1 complaint in API usability surveys.


HTTP Methods: Semantics and Idempotency

Understanding idempotency is critical — it defines retry safety and how clients recover from network failures.

MethodSemanticsIdempotentSafeCacheable
GETRead resource
HEADRead headers only
OPTIONSRead capabilities
PUTReplace resource entirely
PATCHPartial update❌ (usually)
DELETERemove resource
POSTCreate / non-idempotent action

Idempotent means calling it N times has the same effect as calling it once. DELETE /orders/99 called twice: second call returns 404 but the system state is identical — order 99 is deleted.

PATCH idempotency caveat: PATCH /counter {"increment": 1} is not idempotent. PATCH /counter {"value": 5} is. Design PATCH bodies to be declarative (set-to-value), not imperative (apply-operation), to achieve idempotency.


HTTP Status Codes: A Precise Taxonomy

2xx — Success
  200 OK            → GET, PUT, PATCH response with body
  201 Created       → POST that created a resource; include Location header
  202 Accepted      → async operation started; polling or webhook to follow
  204 No Content    → DELETE, PUT/PATCH with no response body needed

3xx — Redirection
  301 Moved Permanently → URL changed forever; update bookmarks
  302 Found             → temporary redirect
  304 Not Modified      → GET with If-None-Match/If-Modified-Since; use cached copy

4xx — Client Error
  400 Bad Request       → malformed syntax, validation failure
  401 Unauthorized      → not authenticated (misleading name: means "unauthenticated")
  403 Forbidden         → authenticated but not authorised
  404 Not Found         → resource doesn't exist
  409 Conflict          → state conflict (e.g., duplicate creation, optimistic lock fail)
  410 Gone              → resource existed and was permanently deleted
  422 Unprocessable     → syntactically valid but semantically invalid
  429 Too Many Requests → rate limited

5xx — Server Error
  500 Internal Server Error → catch-all; don't expose stack traces
  502 Bad Gateway           → upstream service returned invalid response
  503 Service Unavailable   → server is overloaded or down; include Retry-After
  504 Gateway Timeout       → upstream service timed out

Common mistakes:

  • Returning 200 with {"success": false, "error": "not found"} — clients can’t programmatically handle errors without parsing body
  • Using 404 for “no results” — an empty collection [] with 200 is correct
  • Using 401 when you mean 403 — the difference matters for auth debugging

Versioning Strategies

APIs need to evolve. How you version determines how painful evolution is.

Strategy 1: URL Path Versioning

GET /v1/users/42
GET /v2/users/42

Pros: Obvious, easy to route at gateway/load balancer level, easy to document separately, easy to deprecate (just redirect old prefix).

Cons: Not “pure REST” (the URL represents a resource, not a version of a resource). Two URLs for the same logical resource. Clients hardcode versions.

Used by: Stripe, Twilio, GitHub, most public APIs.

Strategy 2: Header Versioning

GET /users/42
API-Version: 2024-01-15

Pros: Clean URLs, easier to support fine-grained versioning (date-based like Stripe).

Cons: Can’t test in browser URL bar, harder to route at the gateway layer, version not visible in logs unless explicitly extracted.

Used by: Stripe (date-based versions like 2023-10-16).

Strategy 3: Content Negotiation (Accept Header)

GET /users/42
Accept: application/vnd.example.v2+json

Pros: Technically “correct” per HTTP spec. Can serve different representations of same resource.

Cons: Verbose, unfamiliar to most developers, poor tooling support.

Used by: GitHub API (partially).

Recommendation

Use URL path versioning for public APIs (/v1/, /v2/). Use date-based header versioning for APIs where clients pin to a specific date (Stripe’s model — excellent for backwards compatibility without explosive version proliferation).

Never version at the field level in the same endpoint — it creates combinatorial complexity.


Pagination Patterns

Offset Pagination

// Request
GET /orders?offset=100&limit=25

// Response
{
  "data": [...],
  "pagination": {
    "total": 1543,
    "offset": 100,
    "limit": 25,
    "hasMore": true
  }
}

Pros: Random access (jump to page 40), easy to implement with SQL LIMIT/OFFSET.

Cons:

  • Inconsistent results during writes: if a record is inserted at position 50 while you’re paginating, page 3 will contain a duplicate of the last item on page 2 (the “page shift” problem).
  • Performance degrades: OFFSET 10000 LIMIT 25 requires the database to scan and discard 10,000 rows — O(n) cost.
-- This is slow at large offsets
SELECT * FROM orders ORDER BY created_at DESC LIMIT 25 OFFSET 10000;
-- Requires full sort + skip of 10,000 rows

Use when: Admin dashboards, analytics UIs where users want “go to page 15” and the dataset is small to medium (<100k rows).

Cursor Pagination (Keyset Pagination)

// Request
GET /orders?cursor=eyJpZCI6MTAwMH0&limit=25

// Response
{
  "data": [...],
  "pagination": {
    "nextCursor": "eyJpZCI6OTc1fQ",  // base64({"id":975})
    "hasMore": true
  }
}
-- Efficient: uses index, no full scan
SELECT * FROM orders
WHERE id < 1000          -- cursor decoded
ORDER BY id DESC
LIMIT 25;
-- Uses B-tree index on id → O(log n) + O(limit)

Pros:

  • O(log n) database cost regardless of page depth
  • Stable results: inserts/deletes don’t shift pages
  • Works for infinite scroll / feed UIs

Cons:

  • No random access — can’t jump to “page 40”
  • Cursor is opaque to clients
  • Sorting by multiple fields requires compound cursor (e.g., {"createdAt": "2024-01-15T10:00:00Z", "id": 42})

Use when: Feeds, timelines, large datasets, infinite scroll, any API where you can’t predict access patterns.

Time-Based Pagination

// Request: get events between timestamps
GET /events?since=2024-01-01T00:00:00Z&until=2024-01-02T00:00:00Z&limit=100

// Useful for: audit logs, analytics, webhook replay

Use when: Data is naturally time-ordered and consumers want to poll for new data (webhooks, audit logs).


Error Response Schema: RFC 7807 Problem Details

Ad-hoc error schemas are a plague. RFC 7807 defines a standard:

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json

{
  "type": "https://api.example.com/errors/validation-error",
  "title": "Validation Error",
  "status": 422,
  "detail": "The request body contains invalid fields.",
  "instance": "/orders/create#2024-01-15T10:30:00Z",
  "errors": [
    {
      "field": "items[0].quantity",
      "code": "MUST_BE_POSITIVE",
      "message": "Quantity must be greater than 0"
    },
    {
      "field": "shippingAddress.postalCode",
      "code": "INVALID_FORMAT",
      "message": "Postal code must match pattern ^[0-9]{6}$"
    }
  ],
  "traceId": "abc123def456"
}

Key fields:

  • type: URI uniquely identifying the error type (machine-readable, links to docs)
  • title: human-readable summary of the error type
  • status: HTTP status code (redundant with HTTP status but useful for middleware)
  • detail: human-readable explanation specific to this occurrence
  • instance: URI identifying this specific occurrence (useful for support)
  • traceId: distributed trace ID for log correlation

Idempotency Keys

POST is not idempotent. Network failures between sending a POST and receiving a response leave the client unable to know if the action was executed. Did the payment go through? Did the order get created?

Solution: Client-supplied idempotency keys.

// Client generates a unique key per logical operation
const idempotencyKey = crypto.randomUUID();

// Attaches it to the request
await fetch('/payments', {
  method: 'POST',
  headers: {
    'Idempotency-Key': idempotencyKey,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ amount: 5000, currency: 'USD' })
});

// If network times out, client retries with SAME key
// Server deduplicates: if key seen, return cached response
// Server-side implementation
async function createPayment(req: Request): Promise<Response> {
  const idempotencyKey = req.headers.get('Idempotency-Key');
  
  if (idempotencyKey) {
    const cached = await redis.get(`idempotency:${idempotencyKey}`);
    if (cached) {
      return new Response(cached, { 
        status: 200,
        headers: { 'Idempotent-Replayed': 'true' }
      });
    }
  }

  // Execute the actual operation
  const payment = await processPayment(req.body);
  const responseBody = JSON.stringify(payment);

  if (idempotencyKey) {
    // Cache for 24 hours — long enough for any reasonable retry window
    await redis.set(`idempotency:${idempotencyKey}`, responseBody, 'EX', 86400);
  }

  return new Response(responseBody, { status: 201 });
}

Storage consideration: Idempotency keys need to be stored for the retry window (typically 24h–7 days). At Stripe’s scale (millions of API calls/day), this is a significant Redis footprint. Use a short TTL and document it clearly.

Scope: Keys are scoped per API key / tenant, not globally. idempotency:{apiKeyHash}:{clientKey}.


GraphQL vs REST: Real Tradeoffs

GraphQL is not strictly better than REST. The tradeoffs are real.

# GraphQL: client specifies exactly what it needs
query {
  user(id: "42") {
    name
    email
    orders(last: 5) {
      id
      total
      status
    }
  }
}
REST equivalent requires:
  GET /users/42           → name, email (+ 20 other fields you don't need)
  GET /users/42/orders    → all orders paginated
ConcernRESTGraphQL
Over-fetchingCommon (fixed response shape)Eliminated (client selects fields)
Under-fetchingCommon (N+1 requests)Eliminated (single query)
CachingSimple (HTTP cache by URL)Hard (POST /graphql, same URL)
Schema evolutionVia versioning strategyAdditive schema evolution
File uploadsSimple multipartComplex (no standard)
Rate limitingPer-endpointHard (query cost unknown)
Error handlingHTTP status codesAlways 200, errors in body
ToolingMatureGrowing (GraphiQL, Apollo)
Learning curveLowMedium-High
N+1 problemNo (resolved at endpoint)Yes (requires DataLoader)

Use GraphQL when:

  • Building a BFF (Backend for Frontend) serving mobile + web with different data needs
  • Internal APIs consumed by teams you control
  • High data-access flexibility needed (exploratory dashboards)

Use REST when:

  • Public API (caching, simplicity, broad tooling support)
  • Simple CRUD with predictable access patterns
  • File upload/download heavy
  • Team unfamiliar with GraphQL and DataLoader patterns

gRPC and Protocol Buffers

gRPC uses Protocol Buffers (protobuf) for efficient binary serialization and HTTP/2 for transport. It’s the standard for internal service-to-service communication at Google, Netflix, and Uber.

// order_service.proto
syntax = "proto3";

package orders.v1;

service OrderService {
  rpc GetOrder (GetOrderRequest) returns (Order);
  rpc CreateOrder (CreateOrderRequest) returns (Order);
  rpc ListOrders (ListOrdersRequest) returns (stream Order);  // server streaming
  rpc BulkCreateOrders (stream CreateOrderRequest) returns (BulkResult);  // client streaming
}

message Order {
  string order_id = 1;
  string user_id = 2;
  repeated LineItem items = 3;
  OrderStatus status = 4;
  int64 created_at_ms = 5;
}

message LineItem {
  string product_id = 1;
  int32 quantity = 2;
  int64 price_cents = 3;
}

enum OrderStatus {
  ORDER_STATUS_UNSPECIFIED = 0;
  ORDER_STATUS_PENDING = 1;
  ORDER_STATUS_CONFIRMED = 2;
  ORDER_STATUS_SHIPPED = 3;
}

Serialization comparison:

JSON payload:   {"orderId":"ord_abc123","userId":"usr_42","status":"CONFIRMED"}
JSON bytes:     ~65 bytes

Protobuf same data:
Binary bytes:   ~18 bytes (72% smaller)
Parse time:     5-10x faster than JSON

gRPC vs REST:

ConcernREST + JSONgRPC + Protobuf
Payload sizeLarger~70% smaller
Parse speedSlower5-10x faster
SchemaOptional (OpenAPI)Required (strongly typed)
Browser supportNativeRequires grpc-web proxy
StreamingSSE / WebSocketNative bi-directional
Code generationOptionalFirst-class (all languages)
Human readable❌ (binary)
Load balancingHTTP/1.1 compatibleRequires L7 load balancer

Use gRPC for: Internal microservice communication, real-time streaming, mobile apps needing high throughput on metered connections.


API Gateway Pattern

An API Gateway is the single entry point for all client requests:

Mobile App  ──┐
Web App     ──┼──► API Gateway ──► Auth Service
Partner API ──┘         │
                        ├──► User Service
                        ├──► Order Service
                        └──► Payment Service

The gateway handles cross-cutting concerns:

  • Authentication: verify JWT, OAuth tokens — services trust gateway
  • Rate limiting: per-client, per-endpoint
  • Request routing: path → service mapping
  • Load balancing: across service instances
  • SSL termination
  • Request/response transformation: e.g., REST → gRPC translation
  • Observability: access logs, traces, metrics
  • Caching: edge caching for GET responses

Popular choices: Kong (open-source, plugin ecosystem), AWS API Gateway (managed, deep AWS integration), Nginx + Lua, Envoy (used internally by many companies as sidecar + gateway).


Backward Compatibility

Never break existing clients. The rules:

// SAFE: additive changes
// Adding new optional fields is backward compatible
{
  "orderId": "123",
  "status": "shipped",
  "trackingUrl": "https://..."    // NEW: clients that don't know about this field ignore it
}

// SAFE: new optional request parameter
GET /orders?includeArchived=true  // clients that don't send it get old behaviour

// BREAKING: removing fields
// Clients reading "status" will break if you remove it

// BREAKING: changing field type
// "amount": 5000  →  "amount": "50.00"  // integer to string breaks clients

// BREAKING: changing enum values
// "status": "in_transit" → "status": "shipped"  // renames break clients

// BREAKING: changing URL structure
// /v1/users/42/orders → /v1/orders?userId=42

Tolerant Reader pattern: Clients should ignore unknown fields in responses (most JSON parsers do this by default with proper configuration). This makes the server side of additive changes safe.

Postel’s Law: “Be conservative in what you send, be liberal in what you accept.” Accept extra fields gracefully; never emit undocumented fields in stable APIs.


Deprecation Strategy

  1. Announce deprecation in changelog, developer portal, and response headers:
Deprecation: Sat, 01 Jun 2024 00:00:00 GMT
Sunset: Mon, 01 Jan 2025 00:00:00 GMT
Link: <https://api.example.com/docs/migration/v2>; rel="successor-version"
  1. Monitor usage — track calls to deprecated endpoints by API key. Reach out directly to heavy users of soon-to-be-removed endpoints.

  2. Minimum deprecation window: 6 months for public APIs, 12 months for enterprise/partner APIs.

  3. Sunset ≠ delete immediately — return 410 Gone with helpful migration message before actually decomissioning infrastructure.


OpenAPI Specification

OpenAPI (formerly Swagger) is the de-facto standard for REST API documentation:

# openapi.yaml
openapi: 3.1.0
info:
  title: Order Service API
  version: 1.0.0

paths:
  /orders:
    get:
      summary: List orders
      parameters:
        - name: cursor
          in: query
          schema: { type: string }
        - name: limit
          in: query
          schema: { type: integer, default: 25, maximum: 100 }
      responses:
        '200':
          description: Order list
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/OrderList'
        '401':
          $ref: '#/components/responses/Unauthorized'

components:
  schemas:
    Order:
      type: object
      required: [orderId, status, total]
      properties:
        orderId:
          type: string
          example: "ord_abc123"
        status:
          type: string
          enum: [pending, confirmed, shipped, cancelled]
        total:
          type: integer
          description: Amount in smallest currency unit (cents)
          example: 9999

Benefits: auto-generate SDKs (openapi-generator), mock servers (Prism), documentation (Redoc, Swagger UI), contract testing (Schemathesis).


Rate Limiting Headers

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1705312800   # Unix timestamp when limit resets
Retry-After: 3600               # Seconds until retry (on 429)

GitHub’s convention (widely adopted):

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4987
X-RateLimit-Used: 13
X-RateLimit-Reset: 1705312800
X-RateLimit-Resource: core

Interview Checklist

REST Fundamentals

  • What are the 6 REST constraints? Which two are most commonly violated?
  • Explain the Richardson Maturity Model with examples
  • What is the difference between 401 Unauthorized and 403 Forbidden?
  • When should you return 202 Accepted vs 201 Created?

Idempotency and Safety

  • Which HTTP methods are idempotent? Which are safe?
  • Why is PATCH not always idempotent? How do you make it idempotent?
  • How do idempotency keys work? How would you implement them in Redis?
  • A client posts a payment but gets a network timeout. What should it do?

Versioning and Evolution

  • Compare URL versioning vs header versioning — tradeoffs?
  • What is a breaking change? Give 5 examples
  • What is the Tolerant Reader pattern?
  • How would you deprecate and sunset a widely-used API endpoint?

Pagination

  • Why is offset pagination slow at large offsets?
  • How does cursor/keyset pagination work? What SQL does it use?
  • When would you choose offset over cursor pagination?

Protocol Comparison

  • REST vs GraphQL: when does each win? What is the N+1 problem in GraphQL?
  • Why use gRPC for internal services? What are the limitations?
  • What does a protobuf wire format gain over JSON?

Architecture

  • What does an API gateway do? What cross-cutting concerns does it handle?
  • Design the API for a ride-sharing app (drivers, riders, trips, payments)
  • How would you version an API that serves 10,000 partner integrations?