The transformer block is the heart of GPT. This chapter dissects MicroGPT's implementation of multi-head causal self-attention, the position-wise feed-forward network, and the layer normalization that stabilizes training. Each of these components has a specific, well-motivated role, and understanding why they exist together is as important as understanding what each one computes individually.
**Self-attention** allows every token in a sequence to gather information from every other token. For each token, the mechanism produces three vectors — a **query** (Q), a **key** (K), and a **value** (V) — by passing the token's embedding through three separate learned linear projections. The attention score between two tokens is the dot product of one token's query with another token's key, scaled by the square root of the head dimension to prevent the softmax from saturating in high-dimensional space.
The word **causal** is critical for language modeling. During training, the model should predict the next token using only past tokens — it must not 'cheat' by looking at future tokens. MicroGPT enforces this with a **causal mask**: an upper-triangular matrix of negative infinity values that is added to the attention scores before the softmax. After softmax, the masked positions become zero, meaning no attention weight flows from past positions to future ones. This masking is what transforms generic self-attention into the autoregressive attention needed for text generation.
**Multi-head** attention runs several attention operations in parallel with different learned projections, then concatenates their outputs. Each head can specialize in a different type of relationship — one head might track subject-verb agreement, another might follow coreference chains. The number of heads is a hyperparameter; MicroGPT keeps it small but retains this multi-head structure because it is fundamental to the GPT architecture.
After multi-head attention allows tokens to communicate with each other, a **feed-forward network** (FFN) is applied independently to each token's representation. This is a simple two-layer MLP: a linear projection that expands the embedding dimension by a factor of 4, a nonlinear activation (typically ReLU or GELU), and a second linear projection that contracts back to the original embedding dimension.
Why expand and contract? The expansion creates a larger intermediate space where the model can perform more complex per-token computations. The factor of 4 is not theoretically derived — it is an empirical finding from the original 'Attention Is All You Need' paper that has been widely replicated. Think of the attention layer as the 'communication' step (tokens share information) and the FFN as the 'computation' step (each token thinks about what it learned).
Because the FFN is applied to each position independently and identically (the same weights for every position), it adds no additional positional bias. All position-sensitive computation happens in the attention layer. This separation of concerns — attention for routing information, FFN for processing it — is one of the elegant design properties of the transformer.
Two structural elements hold the transformer block together: **residual connections** and **layer normalization**. A residual connection adds the block's input directly to its output (`x = x + sublayer(x)`). This creates a gradient highway during backpropagation — gradients can flow directly from the loss to early layers without passing through all the multiplicative operations in the attention and FFN layers. Without residuals, deep transformers are extremely difficult to train.
**Layer normalization** normalizes the activations across the embedding dimension (not the batch dimension, unlike batch normalization). It is applied before each sublayer in MicroGPT's implementation — this 'pre-norm' arrangement, rather than the 'post-norm' arrangement used in the original transformer paper, has been found to produce more stable training in practice and is the convention followed by GPT-2 and later models.
Together, residuals and layer norm make it possible to stack many transformer blocks without training instability. MicroGPT keeps the number of layers small, but the architecture scales to dozens or hundreds of layers by the same mechanism. When you see `Block` — a class that wraps one attention layer and one FFN layer with these stabilization structures — recognize it as the fundamental repeating unit of every GPT-class model.