Glossary¶

Term	Definition
ANN	Approximate Nearest Neighbor — find similar vectors without exact search
Agent	AI system that autonomously decides what actions to take
Alignment	Training models to follow human preferences (RLHF, DPO)
Chunking	Splitting documents into retrievable units for RAG
Context window	Maximum tokens a model can process at once
Distillation	Training a small model to mimic a large one
DPO	Direct Preference Optimization — alignment without a reward model
Embedding	Dense vector representation of text/images for similarity search
Fine-tuning	Adapting a pretrained model to a specific task or domain
Function calling	Model outputs structured tool invocations for the host to execute
GGUF	File format for quantized models, used by llama.cpp and Ollama
HNSW	Hierarchical Navigable Small World — dominant ANN index algorithm
KV cache	Cached attention key-value pairs during autoregressive generation
LoRA	Low-Rank Adaptation — efficient fine-tuning by adding small trainable matrices
MCP	Model Context Protocol — open standard for AI tool integration
MoE	Mixture of Experts — activate subset of parameters per token
PagedAttention	vLLM's technique for efficient KV cache memory management
Pretraining	Initial training on large text corpus (next-token prediction)
Quantization	Reducing model precision (FP16→INT8→INT4) to save memory
RAG	Retrieval-Augmented Generation — augment prompts with retrieved documents
ReAct	Reason + Act pattern for agentic AI
RLHF	Reinforcement Learning from Human Feedback
SFT	Supervised Fine-Tuning on instruction-response pairs
Structured output	Constraining model to output valid JSON matching a schema
Tool use	Pattern where model outputs structured calls, host executes
Transformer	Neural network architecture based on self-attention (Vaswani et al., 2017)