RAG From First Principles: Architecture of a Private Retrieval Pipeline

Why RAG instead of fine-tuning?

Fine-tuning teaches a model behavior; retrieval gives it knowledge at answer time. When the answer lives in 40,000 PDFs that change weekly, you don't retrain — you retrieve. Production systems usually combine both: a small model fine-tuned for tone and format, fed by a retrieval pipeline over your documents.

The five stages of a private RAG pipeline

1. Ingestion & chunking. Split documents into 300–800 token chunks along structural boundaries (headings, paragraphs), keeping titles as context. Bad chunking is the #1 cause of bad RAG.

2. Embedding. Each chunk becomes a vector with an on-prem model like nomic-embed-text or BGE-M3 — never a cloud embedding API, or your documents just left the building one paragraph at a time.

3. Hybrid retrieval. Vector similarity finds meaning; BM25 keyword search finds exact part numbers and clause references. Fuse both (reciprocal rank fusion) — German compound nouns make pure vector search miss precise terms surprisingly often.

4. Re-ranking. A small cross-encoder re-scores the top 30 candidates down to the best 5. This single stage typically adds more answer quality than any model upgrade.

5. Grounded generation. The LLM answers only from supplied chunks, citing them, with an explicit "not in the documents" escape hatch.

def answer(question: str) -> dict:
    q_vec   = embed(question)                      # local model
    cands   = vector_search(q_vec, k=30)           # pgvector / Qdrant
    cands  += bm25_search(question, k=30)          # exact terms
    top     = rerank(question, dedupe(cands))[:5]  # cross-encoder

    prompt = f"""Answer ONLY from the sources below. Cite as [n].
If the sources don't contain the answer, say so.

{format_sources(top)}

Question: {question}"""
    return {"answer": llm(prompt), "sources": top}

What we measure before go-live

Retrieval recall on a golden question set (target: correct chunk in top-5 ≥ 90%), groundedness (no claims outside sources), and refusal correctness (it must say "not documented" when true). A RAG system that confidently invents clause numbers is worse than no system — evaluation is not optional.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment