Skip to content

RAG Architectures

What Is RAG?

Retrieval-Augmented Generation: instead of relying only on the model's training data, retrieve relevant documents at query time and include them in the prompt.

User query
    ↓
Retrieve relevant chunks from a knowledge base
    ↓
Stuff retrieved chunks into prompt context
    ↓
Model generates answer grounded in retrieved text

Why RAG?

  • Freshness: Knowledge base can be updated without retraining
  • Grounding: Reduces hallucination by providing source text
  • Domain specificity: Add your own docs, papers, codebase
  • Cost: Cheaper than fine-tuning for most use cases

Architecture Variants

Naive RAG

Query → Embed → Vector search → Top-K chunks → LLM → Answer

Simple, works surprisingly well for many cases. Fails when: - Query and answer use different vocabulary - Answer requires synthesizing across multiple documents - Chunks lose context from surrounding text

Graph RAG (LightRAG, Microsoft GraphRAG)

Documents → Extract entities + relationships → Build knowledge graph
Query → Graph traversal + vector search → Subgraph context → LLM → Answer

LightRAG (HKUDS): Lightweight graph-based RAG. Extracts entities and relations, stores in PostgreSQL/AGE graph + Qdrant vectors. Dual retrieval: graph traversal for structural queries, vector search for semantic queries.

Key insight: Graphs capture relationships between concepts that vector similarity misses.

Agentic RAG

Query → Agent decides retrieval strategy → Multiple retrieval calls →
Agent synthesizes → May retrieve more → Answer

The agent decides how to retrieve, not just what. Can: - Reformulate queries - Retrieve from multiple sources - Verify answers against sources - Iterate until satisfied

Combine vector (semantic) search with keyword (BM25) search:

Method Finds Misses
Vector search Semantically similar Exact terms, rare words
Keyword search Exact matches Paraphrased content
Hybrid Both Less than either alone

Chunking Strategies

How you split documents into retrievable units matters enormously:

Strategy How Best For
Fixed-size Split every N tokens Simple, predictable
Sentence Split on sentence boundaries Preserving meaning
Semantic Split when topic changes Long documents
Recursive Split by headers, then paragraphs, then sentences Structured documents
Parent-child Retrieve child chunks, include parent for context Maintaining context

My Experience

  • Used LightRAG (investigated for OSS contribution) — graph extraction is powerful but adds complexity
  • Ollama + manual prompting for simple retrieval tasks
  • For code: embedding-based search (Cursor, Claude Code's Grep) works better than document RAG