Text becomes numbers: tokenization
An LLM never sees words. A tokenizer splits text into subword units — Maschinenbau might become Masch + inen + bau — and maps each to an integer. A 7B model typically has a vocabulary of 32k–150k tokens; everything it will ever say is a sequence of these integers.
Attention is a lookup, not magic
The transformer's core trick: for every token, compute a query, and compare it against the keys of all previous tokens. High similarity means "this earlier token is relevant right now," and the token's value gets blended into the representation. That is the whole mechanism — repeated across 30+ layers and many heads in parallel:
import numpy as np
def attention(Q, K, V):
# scores: how relevant is each past token to each current token?
scores = Q @ K.T / np.sqrt(K.shape[-1])
# causal mask: a token may not look into the future
mask = np.triu(np.ones_like(scores), k=1) * -1e9
weights = np.exp(scores + mask)
weights /= weights.sum(axis=-1, keepdims=True)
return weights @ V # weighted blend of past values
# 4 tokens, 8-dim heads — the real thing is just this, bigger
Q = K = V = np.random.randn(4, 8)
print(attention(Q, K, V).shape) # (4, 8)
Generation: one token at a time, with a cache
Inference is a loop: predict a probability distribution over the vocabulary, sample one token, append, repeat. The KV cache stores the keys and values of all previous tokens so each new step only computes attention for the newest one — that cache, not the weights, is what eats your VRAM at long context lengths.
Why small models got good
Three shifts between 2023 and 2026: training on far more tokens per parameter (Chinchilla-style and beyond), distillation from large teachers, and much better data curation. The result: a modern 3–7B model fine-tuned on your domain reliably beats a 2023-era 70B generalist on your tasks — while running on one workstation GPU inside your own building. That asymmetry is the entire economic basis of on-premises AI.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment