Text is not tokens
The first thing that trips up engineers new to LLMs: tokens are not words. They are subword units produced by a compression algorithm. Understanding this is non-negotiable for cost estimation and debugging.
"hello world" → ["hello", " world"] # 2 tokens
"unbelievably" → ["un", "believ", "ably"] # 3 tokens
"ChatGPT" → ["Chat", "G", "PT"] # 3 tokens
"17/11/2024" → ["17", "/", "11", "/", "2024"] # 5 tokens
A rough rule of thumb: 1 token ≈ 4 characters ≈ ¾ of an English word. But code, numbers, and non-English text are far less efficient — Python code can be 2–3× more tokens per character than prose.
Byte-Pair Encoding (BPE)
BPE is the dominant tokenisation algorithm (used by GPT, Llama, Mistral). It is a bottom-up compression scheme:
- Start with a vocabulary of individual bytes (256 entries).
- Count the most frequent adjacent pair of tokens in the training corpus.
- Merge that pair into a new token. Add it to the vocabulary.
- Repeat until the vocabulary reaches the target size (commonly 32k–100k tokens).
The result is that common words like “the” become single tokens, while rare or compound words are split. The vocabulary is fixed after training — the tokeniser cannot learn new tokens without retraining.
# Conceptual BPE merge step
corpus = ["l o w", "l o w e r", "n e w e r"]
# Most frequent pair: ("e", "r") → merge to "er"
corpus = ["l o w", "l o w er", "n ew er"]
# Most frequent pair: ("n", "ew") → merge to "new"
# ... continues until vocab size reached
Why this matters for engineers: You cannot cheaply count tokens without the tokeniser. “Is my 5 000-word document within the 8k token limit?” — you must run it through the tokeniser. OpenAI’s tiktoken library does this client-side in microseconds.
Token economics
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Your prompt text here")
cost = len(tokens) * 5e-6 # $5 per 1M input tokens (gpt-4o, May 2025)
At scale, token count becomes a primary cost lever:
- Trim unnecessary system prompt boilerplate
- Use shorter output formats (JSON with abbreviations vs verbose prose)
- Cache repeated prefixes (some providers charge 0% for cache hits)
Dense vector embeddings
An embedding model maps text to a fixed-length float vector. Semantically similar text maps to nearby vectors in that space. This is not word frequency — it is a learned geometric representation.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dimensions
vecs = model.encode([
"The bank raised interest rates.",
"The financial institution increased borrowing costs.",
"She sat on the river bank."
])
# vecs[0] ≈ vecs[1] (cosine similarity ~0.92)
# vecs[0] vs vecs[2] (cosine similarity ~0.14)
The model disambiguates “bank” (finance vs geography) through context. This is impossible with keyword search.
Common embedding models
| Model | Dimensions | Max tokens | Best for |
|---|---|---|---|
text-embedding-3-small | 1 536 | 8 191 | Cost-sensitive RAG |
text-embedding-3-large | 3 072 | 8 191 | Higher accuracy needs |
all-MiniLM-L6-v2 | 384 | 256 | On-device, low latency |
bge-large-en-v1.5 | 1 024 | 512 | Open-source, strong perf |
nomic-embed-text | 768 | 8 192 | Long-document open-source |
Dimensionality tradeoffs
Higher dimensions capture finer-grained semantic distinctions but cost more:
| Dim | Storage per million vectors | ANN index build time | Retrieval latency |
|---|---|---|---|
| 384 | ~1.5 GB (float32) | Fast | Very low |
| 768 | ~3 GB | Moderate | Low |
| 1 536 | ~6 GB | Slow | Moderate |
| 3 072 | ~12 GB | Very slow | Higher |
OpenAI’s text-embedding-3 models support Matryoshka truncation — you can truncate the vector to fewer dimensions (e.g., 256) and retain most accuracy. Useful when storage or latency dominates.
Interview angle
“How do you build a semantic search feature?” The expected answer walks through: choose an embedding model matching your latency/cost constraints → embed all documents at index time → store in a vector DB → at query time embed the user query → ANN search → return top-k results. The follow-up will be about what happens when the document exceeds the model’s max token limit — you chunk it (covered in the RAG topic).