Embeddings Explained: Building Semantic Search Over 100,000 Documents

What an embedding encodes

An embedding model maps text to a point in high-dimensional space (768 or 1024 numbers) where distance means meaning. "Kündigungsfrist" and "notice period" land close together; "delivery delay" lands far away. The model learned this geometry from hundreds of millions of text pairs — you inherit it for free.

Why cosine similarity works

Embedding vectors are normalized to length 1, so the cosine of the angle between them is just the dot product — one multiply-accumulate per dimension. Similar meaning → small angle → cosine near 1. The entire "semantic" part of semantic search is this single number.

A complete index in 60 lines

import numpy as np, requests, json, glob

def embed(texts: list[str]) -> np.ndarray:
    r = requests.post("http://127.0.0.1:11434/api/embed",
        json={"model": "nomic-embed-text", "input": texts})
    v = np.array(r.json()["embeddings"], dtype=np.float32)
    return v / np.linalg.norm(v, axis=1, keepdims=True)

# --- build ---
chunks, meta = [], []
for path in glob.glob("docs/**/*.txt", recursive=True):
    text = open(path, encoding="utf-8").read()
    for i in range(0, len(text), 1500):           # naive chunking
        chunks.append(text[i:i+1800])
        meta.append({"file": path, "offset": i})

index = np.vstack([embed(chunks[i:i+64])
                   for i in range(0, len(chunks), 64)])
np.save("index.npy", index)
json.dump(meta, open("meta.json", "w"))

# --- search ---
def search(query: str, k: int = 5):
    q = embed([query])[0]
    scores = index @ q                  # cosine, all docs at once
    top = np.argsort(-scores)[:k]
    return [(float(scores[i]), meta[i], chunks[i][:200]) for i in top]

for s, m, preview in search("Wie lange ist die Gewährleistung?"):
    print(f"{s:.3f}  {m['file']}  {preview!r}")

When NumPy is enough — and when it isn't

Brute-force dot products handle ~500k chunks in tens of milliseconds on a CPU. Past that, or when you need filters, persistence and concurrent writers, graduate to pgvector or Qdrant. The embedding model stays the same — and it stays on your server, because every query your employees type is itself confidential data.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment