| ANN |
Approximate Nearest Neighbor — find similar vectors without exact search |
| Agent |
AI system that autonomously decides what actions to take |
| Alignment |
Training models to follow human preferences (RLHF, DPO) |
| Chunking |
Splitting documents into retrievable units for RAG |
| Context window |
Maximum tokens a model can process at once |
| Distillation |
Training a small model to mimic a large one |
| DPO |
Direct Preference Optimization — alignment without a reward model |
| Embedding |
Dense vector representation of text/images for similarity search |
| Fine-tuning |
Adapting a pretrained model to a specific task or domain |
| Function calling |
Model outputs structured tool invocations for the host to execute |
| GGUF |
File format for quantized models, used by llama.cpp and Ollama |
| HNSW |
Hierarchical Navigable Small World — dominant ANN index algorithm |
| KV cache |
Cached attention key-value pairs during autoregressive generation |
| LoRA |
Low-Rank Adaptation — efficient fine-tuning by adding small trainable matrices |
| MCP |
Model Context Protocol — open standard for AI tool integration |
| MoE |
Mixture of Experts — activate subset of parameters per token |
| PagedAttention |
vLLM's technique for efficient KV cache memory management |
| Pretraining |
Initial training on large text corpus (next-token prediction) |
| Quantization |
Reducing model precision (FP16→INT8→INT4) to save memory |
| RAG |
Retrieval-Augmented Generation — augment prompts with retrieved documents |
| ReAct |
Reason + Act pattern for agentic AI |
| RLHF |
Reinforcement Learning from Human Feedback |
| SFT |
Supervised Fine-Tuning on instruction-response pairs |
| Structured output |
Constraining model to output valid JSON matching a schema |
| Tool use |
Pattern where model outputs structured calls, host executes |
| Transformer |
Neural network architecture based on self-attention (Vaswani et al., 2017) |