How LLMs Actually Work: Tokens, Attention and Why Small Models Got Good

Text becomes numbers: tokenization

An LLM never sees words. A tokenizer splits text into subword units — Maschinenbau might become Masch + inen + bau — and maps each to an integer. A 7B model typically has a vocabulary of 32k–150k tokens; everything it will ever say is a sequence of these integers.

Attention is a lookup, not magic

The transformer's core trick: for every token, compute a query, and compare it against the keys of all previous tokens. High similarity means "this earlier token is relevant right now," and the token's value gets blended into the representation. That is the whole mechanism — repeated across 30+ layers and many heads in parallel:

import numpy as np

def attention(Q, K, V):
    # scores: how relevant is each past token to each current token?
    scores = Q @ K.T / np.sqrt(K.shape[-1])
    # causal mask: a token may not look into the future
    mask = np.triu(np.ones_like(scores), k=1) * -1e9
    weights = np.exp(scores + mask)
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V  # weighted blend of past values

# 4 tokens, 8-dim heads — the real thing is just this, bigger
Q = K = V = np.random.randn(4, 8)
print(attention(Q, K, V).shape)  # (4, 8)

Generation: one token at a time, with a cache

Inference is a loop: predict a probability distribution over the vocabulary, sample one token, append, repeat. The KV cache stores the keys and values of all previous tokens so each new step only computes attention for the newest one — that cache, not the weights, is what eats your VRAM at long context lengths.

Why small models got good

Three shifts between 2023 and 2026: training on far more tokens per parameter (Chinchilla-style and beyond), distillation from large teachers, and much better data curation. The result: a modern 3–7B model fine-tuned on your domain reliably beats a 2023-era 70B generalist on your tasks — while running on one workstation GPU inside your own building. That asymmetry is the entire economic basis of on-premises AI.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment