Fine-tune vs prompt engineer
This is the most important decision. Prompt engineering first, always. Fine-tuning has real costs and risks — do it only when you have a concrete reason.
| Signal | Recommendation |
|---|---|
| < 100 examples | Prompt engineer with few-shot |
| 100–1 000 examples | Evaluate both; fine-tune if accuracy still insufficient |
| > 1 000 high-quality examples | Fine-tuning is worth evaluating |
| Need consistent output format | Fine-tuning wins (reduces token waste on format instructions) |
| Need domain-specific style/tone | Fine-tuning wins |
| Need new factual knowledge | Fine-tuning is unreliable — use RAG instead |
| Latency sensitive (smaller model must match larger model) | Fine-tuning is the path |
The dangerous misconception: fine-tuning teaches the model new facts. It does not — or does so unreliably. It adjusts behaviour, style, and output format. For factual grounding, use RAG.
Full fine-tuning
All model weights are updated. Requires the full model in VRAM plus gradient and optimiser state — roughly 4× model parameter memory in fp32 (Adam) or 2× with bf16 + gradient checkpointing. A 7B model in fp16 is ~14 GB; full fine-tuning needs ~56 GB — that is at minimum 2× A100 80GB GPUs.
Full fine-tuning risks catastrophic forgetting: the model loses general capabilities as it overfits to the training distribution.
PEFT: Parameter-Efficient Fine-Tuning
PEFT methods update a small fraction of weights while freezing the base model. The dominant technique is LoRA.
LoRA — Low-Rank Adaptation
The key insight: weight updates during fine-tuning are low-rank. Instead of updating the full weight matrix W (shape d × d), LoRA injects two small matrices A (d × r) and B (r × d) where r << d:
W_new = W_frozen + B × A
At rank r=8 on a weight matrix of shape 4096×4096: original parameters = 16.7M, LoRA parameters = 2 × 4096 × 8 = 65 536. That is 99.6% parameter reduction.
from peft import get_peft_model, LoraConfig, TaskType
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # rank — higher = more capacity, more params, potential overfitting
lora_alpha=32, # scaling factor, typically 2× to 4× r
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"] # which weight matrices to adapt
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable: 0.06%
Rank sensitivity: r=4 is often sufficient for style/format tasks. r=16–64 for complex domain adaptation. Higher rank raises the risk of overfitting on small datasets.
QLoRA — Quantised LoRA
QLoRA (Dettmers et al., 2023) loads the base model in 4-bit NF4 quantisation (reducing memory by ~4×), then trains LoRA adapters in bf16. A 7B model that normally needs 14 GB VRAM fits in ~5 GB, enabling fine-tuning on a single consumer GPU.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
Dataset preparation for instruction tuning
Format: each example is a (instruction, input, output) triple, serialised as a chat message sequence. Common formats: Alpaca, ShareGPT, ChatML.
{"messages": [
{"role": "system", "content": "You extract structured data from legal contracts."},
{"role": "user", "content": "Contract text: ... Extract party names and dates."},
{"role": "assistant", "content": "{\"party_a\": \"Acme Corp\", \"party_b\": \"XYZ Ltd\", \"effective_date\": \"2024-03-01\"}"}
]}
Quality over quantity. 500 carefully curated, diverse, correctly labelled examples beat 10 000 noisy ones. Common quality issues: output inconsistency across similar inputs, outputs that contradict the base model’s capabilities, mislabelled examples. Spend 80% of your dataset effort on cleaning.
Catastrophic forgetting
Fine-tuning on a narrow domain degrades general capabilities. Mitigations:
- Mix a small amount (~10%) of general instruction-following data into your fine-tuning set
- Use PEFT (LoRA) rather than full fine-tuning — the frozen base retains general knowledge
- Evaluate on a held-out general benchmark before and after training
Cost reality check
| Scenario | Compute | Time | Approx cost (cloud) |
|---|---|---|---|
| QLoRA 7B, 1k examples | 1× A10G 24GB | ~2 hours | ~$3 |
| LoRA 7B, 10k examples | 1× A100 80GB | ~4 hours | ~$12 |
| Full FT 13B, 50k examples | 4× A100 80GB | ~24 hours | ~$300 |
| Full FT 70B, 50k examples | 8× A100 80GB | ~5 days | ~$3 000 |
Serving adds cost: LoRA adapters are typically merged back into base weights for serving (no runtime overhead), or loaded dynamically with adapter servers (adds latency, enables adapter hot-swapping per tenant).