Why RNNs lost
Recurrent Neural Networks processed tokens one at a time, left to right. Two fatal flaws came from that sequential design:
- No parallelism during training. Token at step t depends on the hidden state from step t-1. You cannot process the sequence in parallel, so GPU utilisation is terrible.
- Vanishing gradients over long sequences. Information from 500 tokens ago is diluted by the time you reach token 501. RNNs forget long-range dependencies — exactly what language understanding needs.
Transformers solved both. Self-attention computes relationships between all pairs of tokens simultaneously, enabling full parallelism and direct long-range connections with no information decay.
Self-attention: Q, K, V
Every token is projected into three vectors via learned weight matrices:
- Query (Q) — “What am I looking for?”
- Key (K) — “What do I contain?”
- Value (V) — “What do I contribute if selected?”
The attention score between token i and token j is the dot product of query i with key j, scaled and softmaxed:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The √d_k scaling prevents dot products from growing so large that softmax saturates into near-zero gradients. The output for each token is a weighted sum of all Value vectors — tokens that are semantically relevant get higher weight.
Multi-head attention runs this computation in parallel with different projection matrices, allowing the model to attend to different aspects (syntax, coreference, semantics) simultaneously. Outputs are concatenated and linearly projected back.
Positional encoding
Self-attention is permutation-invariant — it sees a set of tokens, not a sequence. Positional encoding injects order. The original paper used sinusoidal functions of different frequencies, which the model learns to interpret. Modern models (GPT, Llama) use Rotary Position Embedding (RoPE), which encodes position directly into the Q/K rotation, enabling better length generalisation.
Encoder vs decoder vs encoder-decoder
| Architecture | Attention type | Training task | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional | Masked language model | BERT, RoBERTa |
| Decoder-only | Causal (left-to-right) | Next-token prediction | GPT, Llama, Mistral |
| Encoder-decoder | Enc: bidirectional, Dec: causal | Seq2seq | T5, BART, Whisper |
Decoder-only models dominate production LLM deployments. Causal attention masks future tokens during training — each token can only attend to preceding tokens. At inference time this means autoregressive generation: the model samples one token, appends it, and runs again.
Context window: memory and cost
The context window is the maximum number of tokens the model can attend to in one forward pass. Attention scores are O(n²) in sequence length — doubling the context quadruples the attention computation.
More importantly, the KV cache (Key and Value tensors for every token in every layer) lives in GPU VRAM. For a 70B model with a 128k context window:
KV cache ≈ 2 × n_layers × n_heads × d_head × context_length × bytes_per_param
≈ 2 × 80 × 8 × 128 × 128_000 × 2 bytes ≈ ~42 GB per request
This is why context window size drives infra cost, not just latency. A single long-context request can consume more VRAM than dozens of short requests. Serving systems like vLLM use paged attention to share and reclaim KV cache memory across concurrent requests.
Interview angles
“Design an LLM-backed customer support system.” The interviewer is testing whether you understand that:
- Long conversation histories balloon context → token cost and latency grow linearly
- You may need a conversation summarisation or sliding window strategy to cap context
- KV cache reuse across turns (prefix caching) can cut cost dramatically if system prompts are stable
“Why is GPT-4 slower than GPT-3.5?” More layers, more heads, larger d_model — attention compute scales with model size, and the KV cache is proportionally larger, reducing batch sizes.
Key numbers to know:
| Model class | Typical context | KV cache (approx per req) |
|---|---|---|
| 7B model, 8k ctx | 8 192 tokens | ~1 GB |
| 70B model, 32k ctx | 32 768 tokens | ~16 GB |
| 70B model, 128k ctx | 131 072 tokens | ~42 GB |