← back to notes
Model Architectureupdated 2026-05-12
Transformers
Self-attention as a sequence operator — and why it generalised.
A working note on the transformer family — not a tutorial, more a record of what I've found useful to remember.
What it is
The transformer is the architectural primitive that replaced recurrence and convolution for sequence modelling. Its core idea is self-attention: every position in a sequence directly attends to every other position, producing a weighted mix that the next layer operates on. No memory state. No locality assumption.
Why it worked
Three properties compound:
- Parallel training — there's no temporal dependency across positions, so you can train the whole sequence in one matmul. RNNs couldn't.
- Bandwidth between positions — every token sees every token. CNNs grew receptive field slowly; transformers had full receptive field from layer 1.
- Inductive bias removed, not added — transformers don't assume much about the data. That made them general. The price was needing more data to learn what CNNs/RNNs assumed for free.
What I keep coming back to
- Attention is
softmax(QKᵀ / √d) V. The√dkeeps the softmax from saturating asdgrows. Easy to forget; load-bearing. - Positional information is added, not implicit. The architecture is permutation-equivariant otherwise.
- The FFN block (linear → GELU → linear) is where most of the parameters live. Attention does the routing; the FFN does the thinking.
Open questions I haven't resolved
- Why does attention generalise so well at scale when it shouldn't, by classical statistical learning theory?
- What is the right way to think about depth vs. width for transformers?