Skip to content

LLM Training Pipeline

The Short Answer

Modern LLMs are built in stages: pretraining (predict next token on internet-scale text) → supervised fine-tuning (learn to follow instructions) → alignment (learn human preferences via RLHF or DPO).

The Pipeline

Stage 1: Pretraining

  • Train on trillions of tokens (web, books, code)
  • Objective: next-token prediction (autoregressive)
  • This is where the model learns language, facts, and reasoning patterns
  • Cost: millions of dollars, weeks of GPU time
  • Output: a "base model" that can complete text but can't follow instructions well

Stage 2: Supervised Fine-Tuning (SFT)

  • Train on curated (instruction, response) pairs
  • Teaches the model to follow instructions, answer questions, refuse harmful requests
  • Much smaller dataset (tens of thousands of examples)
  • Output: an "instruct model" that follows directions

Stage 3: Alignment

RLHF (Reinforcement Learning from Human Feedback): 1. Collect human preference data (which response is better?) 2. Train a reward model on these preferences 3. Use PPO to optimize the LLM against the reward model

DPO (Direct Preference Optimization): - Skip the reward model entirely - Directly optimize the LLM on preference pairs - Simpler, cheaper, increasingly preferred

Key Concepts

Concept What It Means
Scaling laws More compute → predictably better performance (Kaplan et al., 2020)
Chinchilla optimal For a given compute budget, there's an optimal model size vs data ratio (Hoffmann et al., 2022)
Emergent abilities Capabilities that appear suddenly at scale (debated — may be metric artifacts)
Mixture of Experts (MoE) Only activate a subset of parameters per token. Larger model, same inference cost
Distillation Train a small model to mimic a large one. How Qwen/Llama small variants are made

What I Use

  • Ollama for running distilled models locally (Qwen 2.5 14B)
  • Anthropic API for Claude (full-scale, no local option)
  • Understanding this pipeline helps evaluate model quality claims — "fine-tuned on X" means SFT, not pretraining