All guides
PrimerThe AGI Scientist · June 20, 2026 · 12 min read
Transformers from first principles
What attention actually computes, built up from nothing — no prior deep-learning background assumed.

If you can multiply matrices and you're willing to think carefully, you can understand a transformer. This primer builds one up from the ground, skipping the jargon until you've seen the machinery.
The one idea: attention
At its core, attention is a weighted lookup. Each token asks a question (a query), every token offers a label (a key), and the match between them decides how much of each token's value gets mixed into the answer. That's it — the rest is bookkeeping.
Why it works
- It's content-addressed. Tokens attend to what's relevant, not just to what's nearby. Long-range structure becomes reachable in one step.
- It's parallel. Unlike a recurrent network, every position is computed at once — which is why transformers scale on modern hardware.
- It composes. Stack attention with simple feed-forward layers and repeat, and representations get richer at every layer.
What to read next
Once the mechanics click, the interesting questions are mechanistic: which heads do what, and how do circuits form? That's where interpretability begins — see the open experiments in the research feed.