Skip to content

Glossary

Term Definition
ANN Approximate Nearest Neighbor — find similar vectors without exact search
Agent AI system that autonomously decides what actions to take
Alignment Training models to follow human preferences (RLHF, DPO)
Chunking Splitting documents into retrievable units for RAG
Context window Maximum tokens a model can process at once
Distillation Training a small model to mimic a large one
DPO Direct Preference Optimization — alignment without a reward model
Embedding Dense vector representation of text/images for similarity search
Fine-tuning Adapting a pretrained model to a specific task or domain
Function calling Model outputs structured tool invocations for the host to execute
GGUF File format for quantized models, used by llama.cpp and Ollama
HNSW Hierarchical Navigable Small World — dominant ANN index algorithm
KV cache Cached attention key-value pairs during autoregressive generation
LoRA Low-Rank Adaptation — efficient fine-tuning by adding small trainable matrices
MCP Model Context Protocol — open standard for AI tool integration
MoE Mixture of Experts — activate subset of parameters per token
PagedAttention vLLM's technique for efficient KV cache memory management
Pretraining Initial training on large text corpus (next-token prediction)
Quantization Reducing model precision (FP16→INT8→INT4) to save memory
RAG Retrieval-Augmented Generation — augment prompts with retrieved documents
ReAct Reason + Act pattern for agentic AI
RLHF Reinforcement Learning from Human Feedback
SFT Supervised Fine-Tuning on instruction-response pairs
Structured output Constraining model to output valid JSON matching a schema
Tool use Pattern where model outputs structured calls, host executes
Transformer Neural network architecture based on self-attention (Vaswani et al., 2017)