AI Engineering · Topic 7 of 8

Fine-Tuning & PEFT

200 XP

Fine-tune vs prompt engineer

This is the most important decision. Prompt engineering first, always. Fine-tuning has real costs and risks — do it only when you have a concrete reason.

SignalRecommendation
< 100 examplesPrompt engineer with few-shot
100–1 000 examplesEvaluate both; fine-tune if accuracy still insufficient
> 1 000 high-quality examplesFine-tuning is worth evaluating
Need consistent output formatFine-tuning wins (reduces token waste on format instructions)
Need domain-specific style/toneFine-tuning wins
Need new factual knowledgeFine-tuning is unreliable — use RAG instead
Latency sensitive (smaller model must match larger model)Fine-tuning is the path

The dangerous misconception: fine-tuning teaches the model new facts. It does not — or does so unreliably. It adjusts behaviour, style, and output format. For factual grounding, use RAG.

Full fine-tuning

All model weights are updated. Requires the full model in VRAM plus gradient and optimiser state — roughly 4× model parameter memory in fp32 (Adam) or 2× with bf16 + gradient checkpointing. A 7B model in fp16 is ~14 GB; full fine-tuning needs ~56 GB — that is at minimum 2× A100 80GB GPUs.

Full fine-tuning risks catastrophic forgetting: the model loses general capabilities as it overfits to the training distribution.

PEFT: Parameter-Efficient Fine-Tuning

PEFT methods update a small fraction of weights while freezing the base model. The dominant technique is LoRA.

LoRA — Low-Rank Adaptation

The key insight: weight updates during fine-tuning are low-rank. Instead of updating the full weight matrix W (shape d × d), LoRA injects two small matrices A (d × r) and B (r × d) where r << d:

W_new = W_frozen + B × A

At rank r=8 on a weight matrix of shape 4096×4096: original parameters = 16.7M, LoRA parameters = 2 × 4096 × 8 = 65 536. That is 99.6% parameter reduction.

from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,               # rank — higher = more capacity, more params, potential overfitting
    lora_alpha=32,     # scaling factor, typically 2× to 4× r
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]  # which weight matrices to adapt
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable: 0.06%

Rank sensitivity: r=4 is often sufficient for style/format tasks. r=16–64 for complex domain adaptation. Higher rank raises the risk of overfitting on small datasets.

QLoRA — Quantised LoRA

QLoRA (Dettmers et al., 2023) loads the base model in 4-bit NF4 quantisation (reducing memory by ~4×), then trains LoRA adapters in bf16. A 7B model that normally needs 14 GB VRAM fits in ~5 GB, enabling fine-tuning on a single consumer GPU.

from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

Dataset preparation for instruction tuning

Format: each example is a (instruction, input, output) triple, serialised as a chat message sequence. Common formats: Alpaca, ShareGPT, ChatML.

{"messages": [
  {"role": "system", "content": "You extract structured data from legal contracts."},
  {"role": "user",   "content": "Contract text: ... Extract party names and dates."},
  {"role": "assistant", "content": "{\"party_a\": \"Acme Corp\", \"party_b\": \"XYZ Ltd\", \"effective_date\": \"2024-03-01\"}"}
]}

Quality over quantity. 500 carefully curated, diverse, correctly labelled examples beat 10 000 noisy ones. Common quality issues: output inconsistency across similar inputs, outputs that contradict the base model’s capabilities, mislabelled examples. Spend 80% of your dataset effort on cleaning.

Catastrophic forgetting

Fine-tuning on a narrow domain degrades general capabilities. Mitigations:

  • Mix a small amount (~10%) of general instruction-following data into your fine-tuning set
  • Use PEFT (LoRA) rather than full fine-tuning — the frozen base retains general knowledge
  • Evaluate on a held-out general benchmark before and after training

Cost reality check

ScenarioComputeTimeApprox cost (cloud)
QLoRA 7B, 1k examples1× A10G 24GB~2 hours~$3
LoRA 7B, 10k examples1× A100 80GB~4 hours~$12
Full FT 13B, 50k examples4× A100 80GB~24 hours~$300
Full FT 70B, 50k examples8× A100 80GB~5 days~$3 000

Serving adds cost: LoRA adapters are typically merged back into base weights for serving (no runtime overhead), or loaded dynamically with adapter servers (adds latency, enables adapter hot-swapping per tenant).