Tokenisation & Embeddings

Text is not tokens

The first thing that trips up engineers new to LLMs: tokens are not words. They are subword units produced by a compression algorithm. Understanding this is non-negotiable for cost estimation and debugging.

"hello world"       → ["hello", " world"]           # 2 tokens
"unbelievably"      → ["un", "believ", "ably"]       # 3 tokens
"ChatGPT"           → ["Chat", "G", "PT"]            # 3 tokens
"17/11/2024"        → ["17", "/", "11", "/", "2024"] # 5 tokens

A rough rule of thumb: 1 token ≈ 4 characters ≈ ¾ of an English word. But code, numbers, and non-English text are far less efficient — Python code can be 2–3× more tokens per character than prose.

Byte-Pair Encoding (BPE)

BPE is the dominant tokenisation algorithm (used by GPT, Llama, Mistral). It is a bottom-up compression scheme:

Start with a vocabulary of individual bytes (256 entries).
Count the most frequent adjacent pair of tokens in the training corpus.
Merge that pair into a new token. Add it to the vocabulary.
Repeat until the vocabulary reaches the target size (commonly 32k–100k tokens).

The result is that common words like “the” become single tokens, while rare or compound words are split. The vocabulary is fixed after training — the tokeniser cannot learn new tokens without retraining.

# Conceptual BPE merge step
corpus = ["l o w", "l o w e r", "n e w e r"]
# Most frequent pair: ("e", "r") → merge to "er"
corpus = ["l o w", "l o w er", "n ew er"]
# Most frequent pair: ("n", "ew") → merge to "new"
# ... continues until vocab size reached

Why this matters for engineers: You cannot cheaply count tokens without the tokeniser. “Is my 5 000-word document within the 8k token limit?” — you must run it through the tokeniser. OpenAI’s tiktoken library does this client-side in microseconds.

Token economics

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Your prompt text here")
cost = len(tokens) * 5e-6  # $5 per 1M input tokens (gpt-4o, May 2025)

At scale, token count becomes a primary cost lever:

Trim unnecessary system prompt boilerplate
Use shorter output formats (JSON with abbreviations vs verbose prose)
Cache repeated prefixes (some providers charge 0% for cache hits)

Dense vector embeddings

An embedding model maps text to a fixed-length float vector. Semantically similar text maps to nearby vectors in that space. This is not word frequency — it is a learned geometric representation.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

vecs = model.encode([
    "The bank raised interest rates.",
    "The financial institution increased borrowing costs.",
    "She sat on the river bank."
])
# vecs[0] ≈ vecs[1]  (cosine similarity ~0.92)
# vecs[0] vs vecs[2] (cosine similarity ~0.14)

The model disambiguates “bank” (finance vs geography) through context. This is impossible with keyword search.

Common embedding models

Model	Dimensions	Max tokens	Best for
`text-embedding-3-small`	1 536	8 191	Cost-sensitive RAG
`text-embedding-3-large`	3 072	8 191	Higher accuracy needs
`all-MiniLM-L6-v2`	384	256	On-device, low latency
`bge-large-en-v1.5`	1 024	512	Open-source, strong perf
`nomic-embed-text`	768	8 192	Long-document open-source

Dimensionality tradeoffs

Higher dimensions capture finer-grained semantic distinctions but cost more:

Dim	Storage per million vectors	ANN index build time	Retrieval latency
384	~1.5 GB (float32)	Fast	Very low
768	~3 GB	Moderate	Low
1 536	~6 GB	Slow	Moderate
3 072	~12 GB	Very slow	Higher

OpenAI’s text-embedding-3 models support Matryoshka truncation — you can truncate the vector to fewer dimensions (e.g., 256) and retain most accuracy. Useful when storage or latency dominates.

Interview angle

“How do you build a semantic search feature?” The expected answer walks through: choose an embedding model matching your latency/cost constraints → embed all documents at index time → store in a vector DB → at query time embed the user query → ANN search → return top-k results. The follow-up will be about what happens when the document exceeds the model’s max token limit — you chunk it (covered in the RAG topic).