Model Architectureupdated 2026-05-12

Transformers

Self-attention as a sequence operator — and why it generalised.

A working note on the transformer family — not a tutorial, more a record of what I've found useful to remember.

What it is

The transformer is the architectural primitive that replaced recurrence and convolution for sequence modelling. Its core idea is self-attention: every position in a sequence directly attends to every other position, producing a weighted mix that the next layer operates on. No memory state. No locality assumption.

Why it worked

Three properties compound:

Parallel training — there's no temporal dependency across positions, so you can train the whole sequence in one matmul. RNNs couldn't.
Bandwidth between positions — every token sees every token. CNNs grew receptive field slowly; transformers had full receptive field from layer 1.
Inductive bias removed, not added — transformers don't assume much about the data. That made them general. The price was needing more data to learn what CNNs/RNNs assumed for free.

What I keep coming back to

Attention is softmax(QKᵀ / √d) V. The √d keeps the softmax from saturating as d grows. Easy to forget; load-bearing.
Positional information is added, not implicit. The architecture is permutation-equivariant otherwise.
The FFN block (linear → GELU → linear) is where most of the parameters live. Attention does the routing; the FFN does the thinking.

Open questions I haven't resolved

Why does attention generalise so well at scale when it shouldn't, by classical statistical learning theory?
What is the right way to think about depth vs. width for transformers?