Observability: Metrics, Traces, Logs — Grind: DSA, System Design, Interview Prep

The Three Pillars: Why You Need All Three

A system is observable if you can understand its internal state from its external outputs alone — without modifying the code or attaching a debugger. Observability is not just about having dashboards. It’s about having enough data to diagnose a failure you’ve never seen before.

The three pillars are complementary, not redundant:

Pillar      What it answers                   Granularity    Cost
---------   --------------------------------  -------------  ----
Metrics     "Is the system healthy right now?" Aggregate      Low
Traces      "Where is this request slow?"      Per-request    Medium
Logs        "What exactly happened here?"      Per-event      High

A latency spike: metrics tell you something is wrong, traces tell you which service is slow, logs tell you what the error was. Without all three, you’re debugging blind.

The Observability Pipeline:

Application
    │
    ├──► Metrics (counters/gauges/histos) ──► Prometheus ──► Grafana
    │
    ├──► Traces (spans) ──────────────────► Jaeger / Tempo ──► Grafana
    │
    └──► Logs (structured JSON) ──────────► Loki / Elasticsearch ──► Grafana / Kibana

Metrics: Time-Series Data

A metric is a numeric measurement taken at a point in time, associated with a name and a set of labels (key-value pairs). Metrics are cheap to store and query across time ranges — they are the backbone of alerting.

The four metric types (Prometheus model):

Counter: Monotonically increasing integer. Never decreases (only reset to 0 on process restart). Use for: requests served, errors, bytes sent.

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",path="/api/users"} 14923
http_requests_total{method="POST",status="500",path="/api/orders"} 7

Rate of increase is more useful than raw value:

rate(http_requests_total[5m])  # requests/second over last 5 minutes

Gauge: Arbitrary value that can go up or down. Use for: current connections, memory usage, queue depth, CPU temperature.

node_memory_MemAvailable_bytes
container_cpu_usage_seconds_total
pg_stat_activity_count{datname="production"}

Histogram: Samples observations into configurable buckets. Exposes _bucket, _count, _sum. Use for: request latency, response size — anything you want percentiles for.

http_request_duration_seconds_bucket{le="0.005"} 1234
http_request_duration_seconds_bucket{le="0.01"}  2891
http_request_duration_seconds_bucket{le="0.025"} 7234
http_request_duration_seconds_bucket{le="0.1"}   14100
http_request_duration_seconds_bucket{le="+Inf"}  14923
http_request_duration_seconds_count              14923
http_request_duration_seconds_sum               1847.2

Summary: Like a histogram but calculates quantiles on the client side. Cheaper to query but cannot be aggregated across instances — use histograms in distributed systems.

Prometheus Data Model and PromQL

Prometheus is a pull-based metrics system. It scrapes /metrics endpoints on a configurable interval (typically 15s). This is the inverse of push-based systems (StatsD, InfluxDB): the monitoring system controls collection, not the application.

PromQL — the query language:

# Error rate: ratio of 5xx responses to total
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# p99 latency from a histogram
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# CPU saturation: per-core usage over 1 minute
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

# Apdex score: fraction of requests under 300ms (satisfied) +
# half of requests between 300ms and 1200ms (tolerating)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

Label cardinality is the performance killer. Each unique combination of label values is a separate time series. Avoid high-cardinality labels like user_id or request_id — they create millions of time series and OOM Prometheus.

RED and USE Methods

Two frameworks for knowing which metrics to create for any given system:

RED Method (Tom Wilkie) — for services (anything that handles requests):

Rate: requests per second
Errors: errors per second (or error ratio)
Duration: latency distribution (p50, p95, p99)

Every service should expose these three metrics. They directly answer “is this service healthy?”

USE Method (Brendan Gregg) — for resources (CPU, disk, network):

Utilization: % of time the resource is busy
Saturation: queue depth / wait time (work it can’t process yet)
Errors: error events

Resource        Utilization          Saturation              Errors
----------      ----------------     -------------------     ------
CPU             cpu_usage %          run queue length        machine check
Memory          used / total         paging / swapping       OOM kills
Disk            iops used / max      wait queue              io errors
Network NIC     bandwidth used/max   tx/rx drops             errors/drops
Database Pool   connections used     wait queue depth        connection errors

SLI, SLO, SLA and Error Budgets

SLI (Service Level Indicator): A quantitative measure of service behavior. Must be measurable. Example: ratio of HTTP requests completing in <200ms.

SLO (Service Level Objective): A target value for an SLI. “99.9% of requests complete in <200ms over a rolling 28-day window.” This is your internal commitment.

SLA (Service Level Agreement): A business contract with consequences for breach (refunds, credits). Typically less strict than your internal SLO — you need margin.

Error Budget: error_budget = 1 - SLO. If your SLO is 99.9%, your error budget is 0.1% of requests. In 28 days = 28 * 24 * 60 = 40,320 minutes. 0.1% of that = 40.3 minutes of downtime you’re “allowed.”

Error budgets make reliability concrete: if your budget is exhausted, stop feature work and focus on reliability. If your budget is healthy, you have capacity to take risks.

Burn rate alerts (Google SRE Book approach):

Burn rate = actual error rate / error budget rate

If error budget depletes in:
  1 hour   → burn rate = 28*24 = 672x (page immediately)
  6 hours  → burn rate = 112x  (page in 5 minutes)
  3 days   → burn rate = ~9x   (ticket, next business day)
  28 days  → burn rate = 1x    (on track to exactly exhaust budget)

Multi-window alerts: alert when burn rate is high over both a short window (1h — sensitivity to spikes) and a long window (6h — filters out noise). This reduces false positives compared to simple threshold alerts.

Percentiles: Why Average Latency Lies

Request latencies (ms): 10, 12, 11, 9, 10, 11, 10, 12, 850, 10

Average: (10+12+11+9+10+11+10+12+850+10) / 10 = 94.5ms
p50:  10ms   (median — half of users see this or better)
p95:  850ms  (1 in 20 users)
p99:  850ms  (1 in 100 users)

The average is 94.5ms but 90% of users see 10-12ms and 10% see 850ms. The average tells you nothing useful. A service can have “good average latency” while 1 in 100 users experiences timeouts.

Long-tail latency is particularly damaging in microservices. If a page requires 10 parallel service calls and each has 1% p99 latency, the probability that at least one call is slow = 1 - (0.99)^10 = 9.6%. Nearly 10% of page loads are slow.

Histograms can be aggregated; summaries cannot. When you have 10 instances of a service, you can sum() their histogram buckets and then compute histogram_quantile(). You cannot average pre-computed quantiles — it’s mathematically invalid.

Distributed Tracing

In a monolith, a stack trace tells you exactly where time was spent. In microservices, a single user request may traverse 20 services — each with its own logs, no shared stack.

Distributed tracing reconstructs the causal chain of operations across service boundaries.

User Request ──► API Gateway ──► Auth Service
                             ──► User Service ──► Postgres
                             ──► Product Service ──► Redis
                                               ──► Inventory Service ──► Kafka

A trace is a tree of spans. Each span represents a unit of work with:

Trace ID (same for all spans in the request)
Span ID (unique per span)
Parent Span ID (links to caller)
Start time + duration
Service name, operation name
Tags (key-value metadata) and logs (timestamped events)

TraceID: abc123
│
├─ [API Gateway] handle_request  0ms ──────────────────────────── 245ms
│   ├─ [Auth] verify_token        2ms ───── 18ms
│   ├─ [User] get_user           20ms ──────────── 45ms
│   │   └─ [Postgres] SELECT      21ms ──── 43ms  ← slow query!
│   └─ [Product] get_cart         22ms ─────────────────────── 240ms
│       ├─ [Redis] GET             23ms ─ 24ms
│       └─ [Inventory] check      25ms ──────────────────── 238ms  ← bottleneck

Without tracing, you’d see elevated API Gateway p99 latency but have no way to attribute it to Inventory Service.

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for instrumentation. It replaces vendor-specific SDKs (Jaeger client, Zipkin client, AWS X-Ray SDK) with a single unified API.

Architecture:

Application SDK
    │
    ▼
OTel Collector (agent or gateway)
    │
    ├──► Jaeger / Grafana Tempo  (traces)
    ├──► Prometheus              (metrics)
    └──► Loki / Elasticsearch   (logs)

W3C Trace Context propagation: Spans are linked across services via HTTP headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
- Format: version-traceId-parentSpanId-flags
tracestate: vendor-specific key-value pairs

// Node.js OpenTelemetry setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'order-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4318/v1/metrics',
    }),
    exportIntervalMillis: 15000,
  }),
});
sdk.start();

// Manual span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');

async function processOrder(orderId: string): Promise<void> {
  const span = tracer.startSpan('processOrder', {
    attributes: { 'order.id': orderId },
  });
  try {
    await span.setAttribute('order.items', 3);
    await chargePayment(orderId);
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (err) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
    span.recordException(err as Error);
    throw err;
  } finally {
    span.end();
  }
}

Sampling Strategies

A high-traffic service may process 100,000 requests/second. Tracing every one costs ~2% CPU and massive storage. You must sample.

Head-based sampling: The decision to sample is made at trace entry (the first service). All downstream services inherit the decision. Simple to implement; configured by sampleratio (e.g., 1%).

Problem: 1% sampling means 99% of traces are discarded. A rare error affecting 0.01% of traffic may produce zero sampled traces.

Tail-based sampling: Collect all spans; make the sampling decision after the trace completes, based on outcomes. Keep 100% of traces with errors, 100% of traces with high latency, 1% of successful fast traces.

Tail-based sampling in OTel Collector:
- Buffer all spans for a trace for up to 10 seconds
- When trace completes (or timeout): apply policies
  - policy: latency > 1000ms → keep
  - policy: status_code == ERROR → keep
  - policy: random_sample 1% → keep
  - else: drop

Tail-based sampling requires buffering all spans in memory, which is expensive. Google Dapper and Jaeger’s tail sampler do this.

Structured Logging

Unstructured logs are for humans. Structured logs are for machines.

# Unstructured (hard to query):
2024-01-15 10:23:45 ERROR failed to process order 12345 for user 67890: payment declined

# Structured JSON (queryable):
{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "4bf92f3577b34da6a",
  "span_id": "00f067aa0ba902b7",
  "user_id": "67890",
  "order_id": "12345",
  "error": "payment_declined",
  "payment_gateway": "stripe",
  "amount_cents": 4999,
  "message": "Failed to process order"
}

The trace_id / span_id fields are the critical link: from a trace in Jaeger, you can jump directly to all logs for that request in Kibana or Loki.

Log levels and when to use them:

Level	Use case	Example
`TRACE`	Extremely verbose, dev only	Function entry/exit
`DEBUG`	Useful for debugging, off in prod	SQL query text
`INFO`	Normal operation milestones	Order created, user logged in
`WARN`	Unexpected but recoverable	Cache miss, retry attempt
`ERROR`	Operation failed, human attention needed	Payment failed
`FATAL`	Process cannot continue	DB connection lost at startup

Dynamic log level adjustment: Production systems should support changing log levels without restart (via config reload or admin API). Raise to DEBUG only when investigating; lower to INFO/WARN normally.

Log Aggregation: ELK and Loki

Elasticsearch + Kibana (ELK):

Logstash or Filebeat ships logs to Elasticsearch
Elasticsearch indexes every field — full-text search and structured queries
Expensive at scale (high storage, high CPU for indexing)
Powerful: query user_id:67890 AND level:error across all services

Grafana Loki:

Stores logs as compressed chunks; only indexes labels (service, level, env)
Query language (LogQL) similar to PromQL
Much cheaper than Elasticsearch for high-volume logs
Trade-off: label-based filtering only; full-text grep is slower

# Loki: find all errors from order-service in last 5 minutes
{service="order-service", level="error"} |= "payment"

# Extract latency from log line and create metric
{service="api-gateway"}
  | json
  | latency_ms > 1000
  | line_format "{{.user_id}} {{.path}} {{.latency_ms}}"

# Rate of errors per service (LogQL metric query)
sum by (service) (
  rate({level="error"}[5m])
)

eBPF: Zero-Code Instrumentation

Extended Berkeley Packet Filter (eBPF) allows running sandboxed programs in the Linux kernel in response to events — without modifying application code or restarting services.

Tools like Cilium (networking), Pixie (observability), and Parca (continuous profiling) use eBPF to:

Intercept syscalls to capture latency of disk I/O, network calls
Map network connections between services automatically (no service mesh sidecar needed)
Profile CPU usage at the function level with <1% overhead
Trace HTTP/gRPC requests at the kernel level (no SDK changes)

eBPF probes attach to kernel functions (kprobe), userspace functions (uprobe), or tracepoints. The BPF program runs in a verified sandbox — it cannot crash the kernel.

# bpftrace: count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Trace slow disk reads (>10ms)
bpftrace -e 'kretprobe:blk_account_io_done /args->rq->io_start_time_ns > 0/ {
  $lat = (nsecs - args->rq->io_start_time_ns) / 1000000;
  if ($lat > 10) { printf("slow IO: %dms\n", $lat); }
}'

Alerting Architecture

Prometheus → Alertmanager → PagerDuty/Slack

# Prometheus alert rule
groups:
  - name: order-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="order-service"}[5m]))
          > 0.01
        for: 5m   # must be true for 5 consecutive minutes (avoids flapping)
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Order service error rate > 1%"
          runbook_url: "https://wiki/runbooks/order-service-errors"
          description: "Error rate is {{ $value | humanizePercentage }}"

Alert fatigue is the #1 failure mode of alerting systems. When alerts are too noisy, on-call engineers begin ignoring them — including real incidents. Solutions:

Alerts should be actionable — every alert needs a runbook
Use for: duration to avoid flapping alerts
Inhibition rules: suppress low-severity alerts when high-severity fires (don’t alert “high latency” when “service down”)
Routing: critical → pager, warning → Slack, info → log only

Severity levels:

Level	Meaning	Response
P1/Critical	Customer-impacting, SLO breaching	Page on-call immediately
P2/High	Degraded but functional, budget burning	Page within 30 min
P3/Medium	Non-urgent, investigate next day	Ticket
P4/Low	Informational	Dashboard only

Real-World Observability Architectures

Uber: At 100M+ trips/day, Uber runs their own time-series database (M3) capable of ingesting billions of metrics per minute. They maintain separate Kafka topics per metric category, with consumers writing to M3 and alerting pipelines. Distributed tracing uses Jaeger (which Uber open-sourced). Their observability tier handles petabytes of data per day.

Cloudflare: Processes ~60 million HTTP requests/second. Uses ClickHouse (columnar OLAP database) for log analytics — queries that would take hours in Elasticsearch complete in seconds. Metrics are aggregated in-network using IPFIX/NetFlow before reaching Prometheus, reducing cardinality.

Interview: Debugging a Latency Spike in Production

The canonical system design interview question. A structured answer:

Step 1: Quantify the scope. Is it all users or a subset? All endpoints or specific ones? One region or global? Check RED metrics dashboards first.

Step 2: Check USE metrics for resources. Is any resource saturated? CPU pegged? Memory pressure causing GC pauses? Network bandwidth? Disk I/O wait?

Step 3: Examine traces. Find representative slow traces. Which service in the call graph is slow? Is it the service itself or a downstream dependency?

Step 4: Check recent deployments. Did anything deploy in the last 30 minutes? Correlate the latency spike timestamp with deploy timestamps.

Step 5: Inspect logs at the bottleneck service. Look for errors, warnings, unusual patterns. Check for lock contention, connection pool exhaustion, cache miss spikes.

Step 6: Check external dependencies. Is the database slow? Are downstream APIs returning 5xx? Is a third-party service degraded?

Step 7: Mitigate. Can you roll back? Can you shed load (circuit breaker)? Can you scale horizontally?

Common culprits in order of frequency:

Database slow query or lock contention
Connection pool exhaustion
GC pause (Java/Go)
Memory pressure causing swap
External API degradation
Bad deployment (config change, OOM)
Traffic spike beyond capacity