AI Infrastructure & LLM Serving

Why LLM serving is hard

LLMs are unlike any other web service workload:

Stateful generation — autoregressive decoding produces one token at a time; each step depends on all previous steps
Memory-hungry — a single 70B model in fp16 occupies 140 GB VRAM; most A100 GPUs have 80 GB
Variable output length — you cannot know in advance how many tokens the model will generate, so capacity planning is non-trivial
Low throughput per GPU — LLM decode is memory-bandwidth-bound, not compute-bound; GPUs sit at 20–30% utilisation during normal serving without advanced batching

The KV cache

During autoregressive generation, the attention computation for token t needs the Key and Value vectors for all previous tokens 1..t-1. Recomputing them from scratch at each step would be O(n²) work. Instead, the KV cache stores and reuses them:

Step 1: token "The"  → compute K,V → cache
Step 2: token "cat"  → compute K,V → cache + load all previous K,V from cache
Step 3: token "sat"  → compute K,V → cache + load all previous K,V from cache
...

Why KV cache dominates GPU memory:

KV cache per request = 2 × n_layers × n_kv_heads × d_head × seq_len × bytes_per_element
# For Llama-3 70B (bf16), 4k context:
= 2 × 80 × 8 × 128 × 4096 × 2 bytes = ~1.3 GB per request

At 4k context a single A100 (80 GB, minus ~70 GB for weights) has room for roughly 7 concurrent requests. This is why context length drives cost — doubling context halves concurrent request capacity, doubling cost-per-request.

Prefix caching (available in vLLM, OpenAI API): if multiple requests share the same prefix (system prompt, RAG preamble), that prefix’s KV cache is computed once and shared. A stable system prompt of 2k tokens saves significant VRAM and latency.

Continuous batching

Traditional static batching waits for a full batch before processing. This wastes GPU time when some sequences finish early — the GPU idles waiting for the slowest sequence.

Continuous batching (also called in-flight batching or iteration-level scheduling) processes one token step at a time across all active sequences. When a sequence finishes (hits EOS), a new request immediately takes its slot:

Step t:   [seq A token 5] [seq B token 12] [seq C token 3]
Step t+1: [seq A token 6] [seq B token 13] [seq D token 1]  ← seq C finished, seq D joins

vLLM implements continuous batching with PagedAttention — KV cache is allocated in fixed-size pages (like OS virtual memory), eliminating fragmentation and enabling fine-grained sharing. This typically delivers 2–5× throughput improvement over static batching.

Quantisation

Full precision inference (fp32) is unnecessary for serving. Modern quantisation maintains accuracy while dramatically reducing memory:

Format	Bits per weight	Memory (7B model)	Accuracy vs fp16	Throughput gain
fp32	32	~28 GB	Baseline	1×
fp16 / bf16	16	~14 GB	~100%	1×
INT8 (SmoothQuant)	8	~7 GB	~99%	1.3–1.8×
GPTQ (INT4)	4	~3.5 GB	~97–99%	1.5–2×
AWQ (INT4)	4	~3.5 GB	~98–99%	1.5–2×
GGUF Q4_K_M	~4.5	~4 GB	~98%	CPU-runnable

GPTQ quantises by minimising per-layer reconstruction error — offline calibration step required. AWQ (Activation-aware Weight Quantisation) protects the 1% of weights with the highest activation magnitudes, achieving better accuracy at the same bit width. Both are supported by vLLM and TGI.

Speculative decoding

LLM decode is sequential — you cannot parallelise generating token 5 before token 4 is done. Speculative decoding breaks this constraint:

A small draft model (e.g., 1B params) generates k candidate tokens in one pass — fast
The target model (e.g., 70B) verifies all k tokens in a single parallel forward pass
Accept tokens greedily until the first mismatch; discard from there

If the draft model is accurate, you get k tokens for the price of one target model pass. Typical speedups: 2–3× latency reduction on low-temperature/deterministic outputs (code, structured data), less improvement on creative/high-temperature outputs.

Latency vs throughput

These are fundamentally at odds in LLM serving:

Optimise for	Strategy	Cost
Latency (TTFT)	Small batch size, reserve capacity, prefix caching	High $/request
Throughput	Large batch size, continuous batching, quantisation	Low $/request, higher latency

Deployment patterns

Pattern	When to use	Notes
Dedicated GPU instance	Consistent high traffic	Best cost/throughput ratio; no cold starts
Serverless GPU (Modal, Replicate, RunPod)	Spiky or low traffic	Cold starts 5–30s; 0 cost at 0 traffic
Spot / preemptible GPU	Batch jobs, eval runs	60–80% cheaper; can be interrupted
API (OpenAI, Anthropic)	Early stage, uncertain scale	No ops burden; pay-per-token
Managed serving (Vertex AI, SageMaker)	Enterprise, compliance	Higher cost, managed scaling

The rule of thumb: at < 10 000 requests/day, a managed API almost always beats self-hosting on total cost when you account for engineering time. Self-hosting pays off above ~100 000 requests/day for models available in open weights.