LLM Training Pipeline¶

The Short Answer¶

Modern LLMs are built in stages: pretraining (predict next token on internet-scale text) → supervised fine-tuning (learn to follow instructions) → alignment (learn human preferences via RLHF or DPO).

The Pipeline¶

Stage 1: Pretraining¶

Train on trillions of tokens (web, books, code)
Objective: next-token prediction (autoregressive)
This is where the model learns language, facts, and reasoning patterns
Cost: millions of dollars, weeks of GPU time
Output: a "base model" that can complete text but can't follow instructions well

Stage 2: Supervised Fine-Tuning (SFT)¶

Train on curated (instruction, response) pairs
Teaches the model to follow instructions, answer questions, refuse harmful requests
Much smaller dataset (tens of thousands of examples)
Output: an "instruct model" that follows directions

Stage 3: Alignment¶

RLHF (Reinforcement Learning from Human Feedback): 1. Collect human preference data (which response is better?) 2. Train a reward model on these preferences 3. Use PPO to optimize the LLM against the reward model

DPO (Direct Preference Optimization): - Skip the reward model entirely - Directly optimize the LLM on preference pairs - Simpler, cheaper, increasingly preferred

Key Concepts¶

Concept	What It Means
Scaling laws	More compute → predictably better performance (Kaplan et al., 2020)
Chinchilla optimal	For a given compute budget, there's an optimal model size vs data ratio (Hoffmann et al., 2022)
Emergent abilities	Capabilities that appear suddenly at scale (debated — may be metric artifacts)
Mixture of Experts (MoE)	Only activate a subset of parameters per token. Larger model, same inference cost
Distillation	Train a small model to mimic a large one. How Qwen/Llama small variants are made

What I Use¶

Ollama for running distilled models locally (Qwen 2.5 14B)
Anthropic API for Claude (full-scale, no local option)
Understanding this pipeline helps evaluate model quality claims — "fine-tuned on X" means SFT, not pretraining