Transformers & Attention — ZERO TO GRIND

Why RNNs lost

Recurrent Neural Networks processed tokens one at a time, left to right. Two fatal flaws came from that sequential design:

No parallelism during training. Token at step t depends on the hidden state from step t-1. You cannot process the sequence in parallel, so GPU utilisation is terrible.
Vanishing gradients over long sequences. Information from 500 tokens ago is diluted by the time you reach token 501. RNNs forget long-range dependencies — exactly what language understanding needs.

Transformers solved both. Self-attention computes relationships between all pairs of tokens simultaneously, enabling full parallelism and direct long-range connections with no information decay.

Self-attention: Q, K, V

Every token is projected into three vectors via learned weight matrices:

Query (Q) — “What am I looking for?”
Key (K) — “What do I contain?”
Value (V) — “What do I contribute if selected?”

The attention score between token i and token j is the dot product of query i with key j, scaled and softmaxed:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The √d_k scaling prevents dot products from growing so large that softmax saturates into near-zero gradients. The output for each token is a weighted sum of all Value vectors — tokens that are semantically relevant get higher weight.

Multi-head attention runs this computation in parallel with different projection matrices, allowing the model to attend to different aspects (syntax, coreference, semantics) simultaneously. Outputs are concatenated and linearly projected back.

Positional encoding

Self-attention is permutation-invariant — it sees a set of tokens, not a sequence. Positional encoding injects order. The original paper used sinusoidal functions of different frequencies, which the model learns to interpret. Modern models (GPT, Llama) use Rotary Position Embedding (RoPE), which encodes position directly into the Q/K rotation, enabling better length generalisation.

Encoder vs decoder vs encoder-decoder

Architecture	Attention type	Training task	Examples
Encoder-only	Bidirectional	Masked language model	BERT, RoBERTa
Decoder-only	Causal (left-to-right)	Next-token prediction	GPT, Llama, Mistral
Encoder-decoder	Enc: bidirectional, Dec: causal	Seq2seq	T5, BART, Whisper

Decoder-only models dominate production LLM deployments. Causal attention masks future tokens during training — each token can only attend to preceding tokens. At inference time this means autoregressive generation: the model samples one token, appends it, and runs again.

Context window: memory and cost

The context window is the maximum number of tokens the model can attend to in one forward pass. Attention scores are O(n²) in sequence length — doubling the context quadruples the attention computation.

More importantly, the KV cache (Key and Value tensors for every token in every layer) lives in GPU VRAM. For a 70B model with a 128k context window:

KV cache ≈ 2 × n_layers × n_heads × d_head × context_length × bytes_per_param
          ≈ 2 × 80 × 8 × 128 × 128_000 × 2 bytes  ≈ ~42 GB per request

This is why context window size drives infra cost, not just latency. A single long-context request can consume more VRAM than dozens of short requests. Serving systems like vLLM use paged attention to share and reclaim KV cache memory across concurrent requests.

Interview angles

“Design an LLM-backed customer support system.” The interviewer is testing whether you understand that:

Long conversation histories balloon context → token cost and latency grow linearly
You may need a conversation summarisation or sliding window strategy to cap context
KV cache reuse across turns (prefix caching) can cut cost dramatically if system prompts are stable

“Why is GPT-4 slower than GPT-3.5?” More layers, more heads, larger d_model — attention compute scales with model size, and the KV cache is proportionally larger, reducing batch sizes.

Key numbers to know:

Model class	Typical context	KV cache (approx per req)
7B model, 8k ctx	8 192 tokens	~1 GB
70B model, 32k ctx	32 768 tokens	~16 GB
70B model, 128k ctx	131 072 tokens	~42 GB