Local Inference¶

Why Run Models Locally¶

Cost: Zero per-token cost after download
Privacy: Data never leaves your machine
Latency: No network round-trip for small models
Offline: Works without internet
Experimentation: Try different models freely

The Stack¶

┌─────────────────────┐
│   Application       │  (your script, CLI tool)
├─────────────────────┤
│   Ollama / LM Studio│  (model management + API)
├─────────────────────┤
│   llama.cpp         │  (inference engine)
├─────────────────────┤
│   GGUF model file   │  (quantized weights)
├─────────────────────┤
│   Hardware           │  (CPU / GPU / Apple Silicon)
└─────────────────────┘

Ollama¶

The easiest way to run models locally.

# Install
brew install ollama

# Run a model
ollama run qwen2.5:14b

# List installed models
ollama list

# Use in scripts (REST API)
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5:14b", "prompt": "hello"}'

# Or via subprocess
ollama run qwen2.5:14b --nowordwrap <<< "your prompt here"

Model Selection Guide¶

Model	Size	Good For	Notes
qwen2.5:7b	4.7GB	Quick tasks, classification	Fast, fits on any Mac
qwen2.5:14b	9GB	Issue screening, code review	My daily driver
qwen2.5:32b	20GB	Complex reasoning	Needs 32GB+ RAM
llama3.1:8b	4.7GB	General chat, summarization	Meta's standard
codellama:13b	7.4GB	Code-specific tasks	Specialized for code
deepseek-coder-v2:16b	8.9GB	Code generation	Strong on code

Performance on Apple Silicon¶

Chip	RAM	Practical Max Model	Tokens/sec (14B)
M1	16GB	14B Q4	~15
M1 Pro	32GB	32B Q4	~20
M2 Pro	32GB	32B Q4	~25
M3 Max	64GB	70B Q4	~20

llama.cpp¶

The engine under Ollama. Use directly when you need: - Custom quantization - Specific sampling parameters - Server mode with OpenAI-compatible API - Batch processing

# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Run with specific parameters
./llama-cli -m model.gguf -p "prompt" --temp 0.0 -n 256

My Setup¶

Hardware: Apple Silicon Mac
Runtime: Ollama
Daily model: qwen2.5:14b (issue screening, code review triage)
Use case: Pre-screening GitHub issues before spending Claude tokens on detailed analysis