Scaling and Distillation¶
Scaling Laws¶
The key insight from Kaplan et al. (2020): model performance is a power law function of compute, data, and parameters. Double compute → predictable improvement.
Chinchilla Scaling (Hoffmann et al., 2022)¶
Previous approach (GPT-3 era): make models as big as possible. Chinchilla showed: for a fixed compute budget, train a smaller model on more data.
| Model | Parameters | Training Tokens | Approach |
|---|---|---|---|
| GPT-3 | 175B | 300B | Over-parameterized |
| Chinchilla | 70B | 1.4T | Compute-optimal |
| Llama 3 | 70B | 15T | Over-trained (inference-optimal) |
Current trend: Over-train smaller models beyond Chinchilla-optimal. The extra training cost pays off because inference is cheaper with smaller models.
Distillation¶
Compress knowledge from a large "teacher" model into a smaller "student" model.
How It Works¶
- Run teacher model on a large dataset, collect output distributions
- Train student to match teacher's output distributions (not just hard labels)
- Student learns the teacher's "dark knowledge" — which wrong answers are almost right
Why It Matters¶
- Qwen 2.5 14B performs well because it's distilled from larger Qwen models
- DeepSeek-R1 distilled reasoning chains into smaller models
- This is how you get good local models on consumer hardware
Quantization¶
Reduce model precision to save memory and speed up inference.
| Format | Bits | Quality | Use Case |
|---|---|---|---|
| FP16 | 16 | Full | Training, high-end serving |
| INT8 | 8 | ~99% | Production serving |
| INT4 (GPTQ/AWQ) | 4 | ~95% | Consumer GPUs |
| GGUF Q4_K_M | ~4.5 | ~96% | CPU/Mac inference via llama.cpp |
My setup: Qwen 2.5 14B Q4_K_M via Ollama on M-series Mac — 9GB, fast enough for issue screening.