Why LLM serving is hard
LLMs are unlike any other web service workload:
- Stateful generation — autoregressive decoding produces one token at a time; each step depends on all previous steps
- Memory-hungry — a single 70B model in fp16 occupies 140 GB VRAM; most A100 GPUs have 80 GB
- Variable output length — you cannot know in advance how many tokens the model will generate, so capacity planning is non-trivial
- Low throughput per GPU — LLM decode is memory-bandwidth-bound, not compute-bound; GPUs sit at 20–30% utilisation during normal serving without advanced batching
The KV cache
During autoregressive generation, the attention computation for token t needs the Key and Value vectors for all previous tokens 1..t-1. Recomputing them from scratch at each step would be O(n²) work. Instead, the KV cache stores and reuses them:
Step 1: token "The" → compute K,V → cache
Step 2: token "cat" → compute K,V → cache + load all previous K,V from cache
Step 3: token "sat" → compute K,V → cache + load all previous K,V from cache
...
Why KV cache dominates GPU memory:
KV cache per request = 2 × n_layers × n_kv_heads × d_head × seq_len × bytes_per_element
# For Llama-3 70B (bf16), 4k context:
= 2 × 80 × 8 × 128 × 4096 × 2 bytes = ~1.3 GB per request
At 4k context a single A100 (80 GB, minus ~70 GB for weights) has room for roughly 7 concurrent requests. This is why context length drives cost — doubling context halves concurrent request capacity, doubling cost-per-request.
Prefix caching (available in vLLM, OpenAI API): if multiple requests share the same prefix (system prompt, RAG preamble), that prefix’s KV cache is computed once and shared. A stable system prompt of 2k tokens saves significant VRAM and latency.
Continuous batching
Traditional static batching waits for a full batch before processing. This wastes GPU time when some sequences finish early — the GPU idles waiting for the slowest sequence.
Continuous batching (also called in-flight batching or iteration-level scheduling) processes one token step at a time across all active sequences. When a sequence finishes (hits EOS), a new request immediately takes its slot:
Step t: [seq A token 5] [seq B token 12] [seq C token 3]
Step t+1: [seq A token 6] [seq B token 13] [seq D token 1] ← seq C finished, seq D joins
vLLM implements continuous batching with PagedAttention — KV cache is allocated in fixed-size pages (like OS virtual memory), eliminating fragmentation and enabling fine-grained sharing. This typically delivers 2–5× throughput improvement over static batching.
Quantisation
Full precision inference (fp32) is unnecessary for serving. Modern quantisation maintains accuracy while dramatically reducing memory:
| Format | Bits per weight | Memory (7B model) | Accuracy vs fp16 | Throughput gain |
|---|---|---|---|---|
| fp32 | 32 | ~28 GB | Baseline | 1× |
| fp16 / bf16 | 16 | ~14 GB | ~100% | 1× |
| INT8 (SmoothQuant) | 8 | ~7 GB | ~99% | 1.3–1.8× |
| GPTQ (INT4) | 4 | ~3.5 GB | ~97–99% | 1.5–2× |
| AWQ (INT4) | 4 | ~3.5 GB | ~98–99% | 1.5–2× |
| GGUF Q4_K_M | ~4.5 | ~4 GB | ~98% | CPU-runnable |
GPTQ quantises by minimising per-layer reconstruction error — offline calibration step required. AWQ (Activation-aware Weight Quantisation) protects the 1% of weights with the highest activation magnitudes, achieving better accuracy at the same bit width. Both are supported by vLLM and TGI.
Speculative decoding
LLM decode is sequential — you cannot parallelise generating token 5 before token 4 is done. Speculative decoding breaks this constraint:
- A small draft model (e.g., 1B params) generates k candidate tokens in one pass — fast
- The target model (e.g., 70B) verifies all k tokens in a single parallel forward pass
- Accept tokens greedily until the first mismatch; discard from there
If the draft model is accurate, you get k tokens for the price of one target model pass. Typical speedups: 2–3× latency reduction on low-temperature/deterministic outputs (code, structured data), less improvement on creative/high-temperature outputs.
Latency vs throughput
These are fundamentally at odds in LLM serving:
| Optimise for | Strategy | Cost |
|---|---|---|
| Latency (TTFT) | Small batch size, reserve capacity, prefix caching | High $/request |
| Throughput | Large batch size, continuous batching, quantisation | Low $/request, higher latency |
Deployment patterns
| Pattern | When to use | Notes |
|---|---|---|
| Dedicated GPU instance | Consistent high traffic | Best cost/throughput ratio; no cold starts |
| Serverless GPU (Modal, Replicate, RunPod) | Spiky or low traffic | Cold starts 5–30s; 0 cost at 0 traffic |
| Spot / preemptible GPU | Batch jobs, eval runs | 60–80% cheaper; can be interrupted |
| API (OpenAI, Anthropic) | Early stage, uncertain scale | No ops burden; pay-per-token |
| Managed serving (Vertex AI, SageMaker) | Enterprise, compliance | Higher cost, managed scaling |
The rule of thumb: at < 10 000 requests/day, a managed API almost always beats self-hosting on total cost when you account for engineering time. Self-hosting pays off above ~100 000 requests/day for models available in open weights.