AI Engineering · Topic 8 of 8

AI Infrastructure & LLM Serving

200 XP

Why LLM serving is hard

LLMs are unlike any other web service workload:

  • Stateful generation — autoregressive decoding produces one token at a time; each step depends on all previous steps
  • Memory-hungry — a single 70B model in fp16 occupies 140 GB VRAM; most A100 GPUs have 80 GB
  • Variable output length — you cannot know in advance how many tokens the model will generate, so capacity planning is non-trivial
  • Low throughput per GPU — LLM decode is memory-bandwidth-bound, not compute-bound; GPUs sit at 20–30% utilisation during normal serving without advanced batching

The KV cache

During autoregressive generation, the attention computation for token t needs the Key and Value vectors for all previous tokens 1..t-1. Recomputing them from scratch at each step would be O(n²) work. Instead, the KV cache stores and reuses them:

Step 1: token "The"  → compute K,V → cache
Step 2: token "cat"  → compute K,V → cache + load all previous K,V from cache
Step 3: token "sat"  → compute K,V → cache + load all previous K,V from cache
...

Why KV cache dominates GPU memory:

KV cache per request = 2 × n_layers × n_kv_heads × d_head × seq_len × bytes_per_element
# For Llama-3 70B (bf16), 4k context:
= 2 × 80 × 8 × 128 × 4096 × 2 bytes = ~1.3 GB per request

At 4k context a single A100 (80 GB, minus ~70 GB for weights) has room for roughly 7 concurrent requests. This is why context length drives cost — doubling context halves concurrent request capacity, doubling cost-per-request.

Prefix caching (available in vLLM, OpenAI API): if multiple requests share the same prefix (system prompt, RAG preamble), that prefix’s KV cache is computed once and shared. A stable system prompt of 2k tokens saves significant VRAM and latency.

Continuous batching

Traditional static batching waits for a full batch before processing. This wastes GPU time when some sequences finish early — the GPU idles waiting for the slowest sequence.

Continuous batching (also called in-flight batching or iteration-level scheduling) processes one token step at a time across all active sequences. When a sequence finishes (hits EOS), a new request immediately takes its slot:

Step t:   [seq A token 5] [seq B token 12] [seq C token 3]
Step t+1: [seq A token 6] [seq B token 13] [seq D token 1]  ← seq C finished, seq D joins

vLLM implements continuous batching with PagedAttention — KV cache is allocated in fixed-size pages (like OS virtual memory), eliminating fragmentation and enabling fine-grained sharing. This typically delivers 2–5× throughput improvement over static batching.

Quantisation

Full precision inference (fp32) is unnecessary for serving. Modern quantisation maintains accuracy while dramatically reducing memory:

FormatBits per weightMemory (7B model)Accuracy vs fp16Throughput gain
fp3232~28 GBBaseline
fp16 / bf1616~14 GB~100%
INT8 (SmoothQuant)8~7 GB~99%1.3–1.8×
GPTQ (INT4)4~3.5 GB~97–99%1.5–2×
AWQ (INT4)4~3.5 GB~98–99%1.5–2×
GGUF Q4_K_M~4.5~4 GB~98%CPU-runnable

GPTQ quantises by minimising per-layer reconstruction error — offline calibration step required. AWQ (Activation-aware Weight Quantisation) protects the 1% of weights with the highest activation magnitudes, achieving better accuracy at the same bit width. Both are supported by vLLM and TGI.

Speculative decoding

LLM decode is sequential — you cannot parallelise generating token 5 before token 4 is done. Speculative decoding breaks this constraint:

  1. A small draft model (e.g., 1B params) generates k candidate tokens in one pass — fast
  2. The target model (e.g., 70B) verifies all k tokens in a single parallel forward pass
  3. Accept tokens greedily until the first mismatch; discard from there

If the draft model is accurate, you get k tokens for the price of one target model pass. Typical speedups: 2–3× latency reduction on low-temperature/deterministic outputs (code, structured data), less improvement on creative/high-temperature outputs.

Latency vs throughput

These are fundamentally at odds in LLM serving:

Optimise forStrategyCost
Latency (TTFT)Small batch size, reserve capacity, prefix cachingHigh $/request
ThroughputLarge batch size, continuous batching, quantisationLow $/request, higher latency

Deployment patterns

PatternWhen to useNotes
Dedicated GPU instanceConsistent high trafficBest cost/throughput ratio; no cold starts
Serverless GPU (Modal, Replicate, RunPod)Spiky or low trafficCold starts 5–30s; 0 cost at 0 traffic
Spot / preemptible GPUBatch jobs, eval runs60–80% cheaper; can be interrupted
API (OpenAI, Anthropic)Early stage, uncertain scaleNo ops burden; pay-per-token
Managed serving (Vertex AI, SageMaker)Enterprise, complianceHigher cost, managed scaling

The rule of thumb: at < 10 000 requests/day, a managed API almost always beats self-hosting on total cost when you account for engineering time. Self-hosting pays off above ~100 000 requests/day for models available in open weights.