The Three Pillars: Why You Need All Three
A system is observable if you can understand its internal state from its external outputs alone — without modifying the code or attaching a debugger. Observability is not just about having dashboards. It’s about having enough data to diagnose a failure you’ve never seen before.
The three pillars are complementary, not redundant:
Pillar What it answers Granularity Cost
--------- -------------------------------- ------------- ----
Metrics "Is the system healthy right now?" Aggregate Low
Traces "Where is this request slow?" Per-request Medium
Logs "What exactly happened here?" Per-event High
A latency spike: metrics tell you something is wrong, traces tell you which service is slow, logs tell you what the error was. Without all three, you’re debugging blind.
The Observability Pipeline:
Application
│
├──► Metrics (counters/gauges/histos) ──► Prometheus ──► Grafana
│
├──► Traces (spans) ──────────────────► Jaeger / Tempo ──► Grafana
│
└──► Logs (structured JSON) ──────────► Loki / Elasticsearch ──► Grafana / Kibana
Metrics: Time-Series Data
A metric is a numeric measurement taken at a point in time, associated with a name and a set of labels (key-value pairs). Metrics are cheap to store and query across time ranges — they are the backbone of alerting.
The four metric types (Prometheus model):
Counter: Monotonically increasing integer. Never decreases (only reset to 0 on process restart). Use for: requests served, errors, bytes sent.
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",path="/api/users"} 14923
http_requests_total{method="POST",status="500",path="/api/orders"} 7
Rate of increase is more useful than raw value:
rate(http_requests_total[5m]) # requests/second over last 5 minutes
Gauge: Arbitrary value that can go up or down. Use for: current connections, memory usage, queue depth, CPU temperature.
node_memory_MemAvailable_bytes
container_cpu_usage_seconds_total
pg_stat_activity_count{datname="production"}
Histogram: Samples observations into configurable buckets. Exposes _bucket, _count, _sum. Use for: request latency, response size — anything you want percentiles for.
http_request_duration_seconds_bucket{le="0.005"} 1234
http_request_duration_seconds_bucket{le="0.01"} 2891
http_request_duration_seconds_bucket{le="0.025"} 7234
http_request_duration_seconds_bucket{le="0.1"} 14100
http_request_duration_seconds_bucket{le="+Inf"} 14923
http_request_duration_seconds_count 14923
http_request_duration_seconds_sum 1847.2
Summary: Like a histogram but calculates quantiles on the client side. Cheaper to query but cannot be aggregated across instances — use histograms in distributed systems.
Prometheus Data Model and PromQL
Prometheus is a pull-based metrics system. It scrapes /metrics endpoints on a configurable interval (typically 15s). This is the inverse of push-based systems (StatsD, InfluxDB): the monitoring system controls collection, not the application.
PromQL — the query language:
# Error rate: ratio of 5xx responses to total
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# p99 latency from a histogram
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# CPU saturation: per-core usage over 1 minute
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
# Apdex score: fraction of requests under 300ms (satisfied) +
# half of requests between 300ms and 1200ms (tolerating)
(
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))
Label cardinality is the performance killer. Each unique combination of label values is a separate time series. Avoid high-cardinality labels like user_id or request_id — they create millions of time series and OOM Prometheus.
RED and USE Methods
Two frameworks for knowing which metrics to create for any given system:
RED Method (Tom Wilkie) — for services (anything that handles requests):
- Rate: requests per second
- Errors: errors per second (or error ratio)
- Duration: latency distribution (p50, p95, p99)
Every service should expose these three metrics. They directly answer “is this service healthy?”
USE Method (Brendan Gregg) — for resources (CPU, disk, network):
- Utilization: % of time the resource is busy
- Saturation: queue depth / wait time (work it can’t process yet)
- Errors: error events
Resource Utilization Saturation Errors
---------- ---------------- ------------------- ------
CPU cpu_usage % run queue length machine check
Memory used / total paging / swapping OOM kills
Disk iops used / max wait queue io errors
Network NIC bandwidth used/max tx/rx drops errors/drops
Database Pool connections used wait queue depth connection errors
SLI, SLO, SLA and Error Budgets
SLI (Service Level Indicator): A quantitative measure of service behavior. Must be measurable. Example: ratio of HTTP requests completing in <200ms.
SLO (Service Level Objective): A target value for an SLI. “99.9% of requests complete in <200ms over a rolling 28-day window.” This is your internal commitment.
SLA (Service Level Agreement): A business contract with consequences for breach (refunds, credits). Typically less strict than your internal SLO — you need margin.
Error Budget: error_budget = 1 - SLO. If your SLO is 99.9%, your error budget is 0.1% of requests. In 28 days = 28 * 24 * 60 = 40,320 minutes. 0.1% of that = 40.3 minutes of downtime you’re “allowed.”
Error budgets make reliability concrete: if your budget is exhausted, stop feature work and focus on reliability. If your budget is healthy, you have capacity to take risks.
Burn rate alerts (Google SRE Book approach):
Burn rate = actual error rate / error budget rate
If error budget depletes in:
1 hour → burn rate = 28*24 = 672x (page immediately)
6 hours → burn rate = 112x (page in 5 minutes)
3 days → burn rate = ~9x (ticket, next business day)
28 days → burn rate = 1x (on track to exactly exhaust budget)
Multi-window alerts: alert when burn rate is high over both a short window (1h — sensitivity to spikes) and a long window (6h — filters out noise). This reduces false positives compared to simple threshold alerts.
Percentiles: Why Average Latency Lies
Request latencies (ms): 10, 12, 11, 9, 10, 11, 10, 12, 850, 10
Average: (10+12+11+9+10+11+10+12+850+10) / 10 = 94.5ms
p50: 10ms (median — half of users see this or better)
p95: 850ms (1 in 20 users)
p99: 850ms (1 in 100 users)
The average is 94.5ms but 90% of users see 10-12ms and 10% see 850ms. The average tells you nothing useful. A service can have “good average latency” while 1 in 100 users experiences timeouts.
Long-tail latency is particularly damaging in microservices. If a page requires 10 parallel service calls and each has 1% p99 latency, the probability that at least one call is slow = 1 - (0.99)^10 = 9.6%. Nearly 10% of page loads are slow.
Histograms can be aggregated; summaries cannot. When you have 10 instances of a service, you can sum() their histogram buckets and then compute histogram_quantile(). You cannot average pre-computed quantiles — it’s mathematically invalid.
Distributed Tracing
In a monolith, a stack trace tells you exactly where time was spent. In microservices, a single user request may traverse 20 services — each with its own logs, no shared stack.
Distributed tracing reconstructs the causal chain of operations across service boundaries.
User Request ──► API Gateway ──► Auth Service
──► User Service ──► Postgres
──► Product Service ──► Redis
──► Inventory Service ──► Kafka
A trace is a tree of spans. Each span represents a unit of work with:
- Trace ID (same for all spans in the request)
- Span ID (unique per span)
- Parent Span ID (links to caller)
- Start time + duration
- Service name, operation name
- Tags (key-value metadata) and logs (timestamped events)
TraceID: abc123
│
├─ [API Gateway] handle_request 0ms ──────────────────────────── 245ms
│ ├─ [Auth] verify_token 2ms ───── 18ms
│ ├─ [User] get_user 20ms ──────────── 45ms
│ │ └─ [Postgres] SELECT 21ms ──── 43ms ← slow query!
│ └─ [Product] get_cart 22ms ─────────────────────── 240ms
│ ├─ [Redis] GET 23ms ─ 24ms
│ └─ [Inventory] check 25ms ──────────────────── 238ms ← bottleneck
Without tracing, you’d see elevated API Gateway p99 latency but have no way to attribute it to Inventory Service.
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for instrumentation. It replaces vendor-specific SDKs (Jaeger client, Zipkin client, AWS X-Ray SDK) with a single unified API.
Architecture:
Application SDK
│
▼
OTel Collector (agent or gateway)
│
├──► Jaeger / Grafana Tempo (traces)
├──► Prometheus (metrics)
└──► Loki / Elasticsearch (logs)
W3C Trace Context propagation: Spans are linked across services via HTTP headers:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01- Format:
version-traceId-parentSpanId-flags
- Format:
tracestate: vendor-specific key-value pairs
// Node.js OpenTelemetry setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'order-service',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics',
}),
exportIntervalMillis: 15000,
}),
});
sdk.start();
// Manual span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processOrder(orderId: string): Promise<void> {
const span = tracer.startSpan('processOrder', {
attributes: { 'order.id': orderId },
});
try {
await span.setAttribute('order.items', 3);
await chargePayment(orderId);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
}
Sampling Strategies
A high-traffic service may process 100,000 requests/second. Tracing every one costs ~2% CPU and massive storage. You must sample.
Head-based sampling: The decision to sample is made at trace entry (the first service). All downstream services inherit the decision. Simple to implement; configured by sampleratio (e.g., 1%).
Problem: 1% sampling means 99% of traces are discarded. A rare error affecting 0.01% of traffic may produce zero sampled traces.
Tail-based sampling: Collect all spans; make the sampling decision after the trace completes, based on outcomes. Keep 100% of traces with errors, 100% of traces with high latency, 1% of successful fast traces.
Tail-based sampling in OTel Collector:
- Buffer all spans for a trace for up to 10 seconds
- When trace completes (or timeout): apply policies
- policy: latency > 1000ms → keep
- policy: status_code == ERROR → keep
- policy: random_sample 1% → keep
- else: drop
Tail-based sampling requires buffering all spans in memory, which is expensive. Google Dapper and Jaeger’s tail sampler do this.
Structured Logging
Unstructured logs are for humans. Structured logs are for machines.
# Unstructured (hard to query):
2024-01-15 10:23:45 ERROR failed to process order 12345 for user 67890: payment declined
# Structured JSON (queryable):
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "error",
"service": "order-service",
"trace_id": "4bf92f3577b34da6a",
"span_id": "00f067aa0ba902b7",
"user_id": "67890",
"order_id": "12345",
"error": "payment_declined",
"payment_gateway": "stripe",
"amount_cents": 4999,
"message": "Failed to process order"
}
The trace_id / span_id fields are the critical link: from a trace in Jaeger, you can jump directly to all logs for that request in Kibana or Loki.
Log levels and when to use them:
| Level | Use case | Example |
|---|---|---|
TRACE | Extremely verbose, dev only | Function entry/exit |
DEBUG | Useful for debugging, off in prod | SQL query text |
INFO | Normal operation milestones | Order created, user logged in |
WARN | Unexpected but recoverable | Cache miss, retry attempt |
ERROR | Operation failed, human attention needed | Payment failed |
FATAL | Process cannot continue | DB connection lost at startup |
Dynamic log level adjustment: Production systems should support changing log levels without restart (via config reload or admin API). Raise to DEBUG only when investigating; lower to INFO/WARN normally.
Log Aggregation: ELK and Loki
Elasticsearch + Kibana (ELK):
- Logstash or Filebeat ships logs to Elasticsearch
- Elasticsearch indexes every field — full-text search and structured queries
- Expensive at scale (high storage, high CPU for indexing)
- Powerful: query
user_id:67890 AND level:erroracross all services
Grafana Loki:
- Stores logs as compressed chunks; only indexes labels (service, level, env)
- Query language (LogQL) similar to PromQL
- Much cheaper than Elasticsearch for high-volume logs
- Trade-off: label-based filtering only; full-text grep is slower
# Loki: find all errors from order-service in last 5 minutes
{service="order-service", level="error"} |= "payment"
# Extract latency from log line and create metric
{service="api-gateway"}
| json
| latency_ms > 1000
| line_format "{{.user_id}} {{.path}} {{.latency_ms}}"
# Rate of errors per service (LogQL metric query)
sum by (service) (
rate({level="error"}[5m])
)
eBPF: Zero-Code Instrumentation
Extended Berkeley Packet Filter (eBPF) allows running sandboxed programs in the Linux kernel in response to events — without modifying application code or restarting services.
Tools like Cilium (networking), Pixie (observability), and Parca (continuous profiling) use eBPF to:
- Intercept syscalls to capture latency of disk I/O, network calls
- Map network connections between services automatically (no service mesh sidecar needed)
- Profile CPU usage at the function level with <1% overhead
- Trace HTTP/gRPC requests at the kernel level (no SDK changes)
eBPF probes attach to kernel functions (kprobe), userspace functions (uprobe), or tracepoints. The BPF program runs in a verified sandbox — it cannot crash the kernel.
# bpftrace: count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Trace slow disk reads (>10ms)
bpftrace -e 'kretprobe:blk_account_io_done /args->rq->io_start_time_ns > 0/ {
$lat = (nsecs - args->rq->io_start_time_ns) / 1000000;
if ($lat > 10) { printf("slow IO: %dms\n", $lat); }
}'
Alerting Architecture
Prometheus → Alertmanager → PagerDuty/Slack
# Prometheus alert rule
groups:
- name: order-service
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{service="order-service",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-service"}[5m]))
> 0.01
for: 5m # must be true for 5 consecutive minutes (avoids flapping)
labels:
severity: critical
team: platform
annotations:
summary: "Order service error rate > 1%"
runbook_url: "https://wiki/runbooks/order-service-errors"
description: "Error rate is {{ $value | humanizePercentage }}"
Alert fatigue is the #1 failure mode of alerting systems. When alerts are too noisy, on-call engineers begin ignoring them — including real incidents. Solutions:
- Alerts should be actionable — every alert needs a runbook
- Use
for:duration to avoid flapping alerts - Inhibition rules: suppress low-severity alerts when high-severity fires (don’t alert “high latency” when “service down”)
- Routing: critical → pager, warning → Slack, info → log only
Severity levels:
| Level | Meaning | Response |
|---|---|---|
| P1/Critical | Customer-impacting, SLO breaching | Page on-call immediately |
| P2/High | Degraded but functional, budget burning | Page within 30 min |
| P3/Medium | Non-urgent, investigate next day | Ticket |
| P4/Low | Informational | Dashboard only |
Real-World Observability Architectures
Uber: At 100M+ trips/day, Uber runs their own time-series database (M3) capable of ingesting billions of metrics per minute. They maintain separate Kafka topics per metric category, with consumers writing to M3 and alerting pipelines. Distributed tracing uses Jaeger (which Uber open-sourced). Their observability tier handles petabytes of data per day.
Cloudflare: Processes ~60 million HTTP requests/second. Uses ClickHouse (columnar OLAP database) for log analytics — queries that would take hours in Elasticsearch complete in seconds. Metrics are aggregated in-network using IPFIX/NetFlow before reaching Prometheus, reducing cardinality.
Interview: Debugging a Latency Spike in Production
The canonical system design interview question. A structured answer:
Step 1: Quantify the scope. Is it all users or a subset? All endpoints or specific ones? One region or global? Check RED metrics dashboards first.
Step 2: Check USE metrics for resources. Is any resource saturated? CPU pegged? Memory pressure causing GC pauses? Network bandwidth? Disk I/O wait?
Step 3: Examine traces. Find representative slow traces. Which service in the call graph is slow? Is it the service itself or a downstream dependency?
Step 4: Check recent deployments. Did anything deploy in the last 30 minutes? Correlate the latency spike timestamp with deploy timestamps.
Step 5: Inspect logs at the bottleneck service. Look for errors, warnings, unusual patterns. Check for lock contention, connection pool exhaustion, cache miss spikes.
Step 6: Check external dependencies. Is the database slow? Are downstream APIs returning 5xx? Is a third-party service degraded?
Step 7: Mitigate. Can you roll back? Can you shed load (circuit breaker)? Can you scale horizontally?
Common culprits in order of frequency:
- Database slow query or lock contention
- Connection pool exhaustion
- GC pause (Java/Go)
- Memory pressure causing swap
- External API degradation
- Bad deployment (config change, OOM)
- Traffic spike beyond capacity