Serving at Scale¶

When Local Isn't Enough¶

Local inference works for personal use. For production or team use, you need: - Concurrent requests: Multiple users hitting the same model - High throughput: Batch processing thousands of requests - Low latency: Sub-second response times - GPU efficiency: Maximizing expensive hardware utilization

Key Concepts¶

Continuous Batching¶

Instead of processing one request at a time, batch multiple requests and process them together. As requests finish, new ones slot in immediately.

KV Cache Management¶

During generation, attention key-value pairs are cached. This cache grows with sequence length and batch size — it's often the memory bottleneck.

PagedAttention (vLLM): Manages KV cache like OS virtual memory — pages of cache can be non-contiguous, reducing fragmentation from ~60% to ~4%.

Speculative Decoding¶

Use a small "draft" model to generate candidate tokens, then verify with the large model in parallel. Speeds up inference 2-3x when the draft model is accurate.

Tools Compared¶

Tool	Key Innovation	Best For
vLLM	PagedAttention	High-throughput serving, production
TGI	Continuous batching + HF integration	HF ecosystem, easy deployment
SGLang	RadixAttention + structured output	Structured generation, multi-turn
TensorRT-LLM	NVIDIA kernel optimization	Maximum perf on NVIDIA GPUs

API Providers¶

When you don't want to manage infrastructure:

Provider	Models	Pricing Model	Notes
Anthropic	Claude family	Per-token	Best for complex reasoning
OpenAI	GPT family	Per-token	Broadest ecosystem
OpenRouter	100+ models	Per-token, routing	Unified API, model switching
Together AI	Open models	Per-token	Good for fine-tuned open models
Fireworks AI	Open models	Per-token	Fast inference, competitive pricing