Understanding Transformers from First Principles

The transformer architecture, introduced in "Attention Is All You Need" (2017), has revolutionized machine learning. Let's understand it from first principles.

Self-Attention Mechanism

The core innovation is self-attention, which allows each position in a sequence to attend to all other positions:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Where:

$Q$ (queries), $K$ (keys), $V$ (values) are linear projections
$d_k$ is the dimension of the keys
The softmax creates attention weights

Why Transformers Work

Parallelization: Unlike RNNs, all positions can be computed simultaneously
Long-range dependencies: Direct connections between any two positions
Scalability: Performance improves predictably with scale

Multi-Head Attention

Instead of single attention, we use multiple "heads":

def multi_head_attention(x, num_heads=8):
    heads = [attention(x, head_dim) for _ in range(num_heads)]
    return concat(heads) @ W_o

This allows the model to attend to information from different representation subspaces.