Table of Contents
1
Orientation: One File, One Model, One Pipeline
2
Tokenization: From Text to Integers and Back
3
Embeddings and Positional Encoding: Giving Tokens Meaning and Position
4
The Transformer Block: Attention, Feed-Forward, and Layer Normalization
5
The GPT Model Class: Assembling the Full Forward Pass
6
Training Loop: Optimization and the Learning Process
7
Text Generation: Autoregressive Inference and Sampling
Library > Gist: 8627fe00... > Chapter 4

The Transformer Block: Attention, Feed-Forward, and Layer Normalization

The transformer block is the heart of GPT. This chapter dissects MicroGPT's implementation of multi-head causal self-attention, the position-wise feed-forward network, and the layer normalization that stabilizes training. Each of these components has a specific, well-motivated role, and understanding why they exist together is as important as understanding what each one computes individually.

Causal Self-Attention: How Tokens Look at Each Other

**Self-attention** allows every token in a sequence to gather information from every other token. For each token, the mechanism produces three vectors — a **query** (Q), a **key** (K), and a **value** (V) — by passing the token's embedding through three separate learned linear projections. The attention score between two tokens is the dot product of one token's query with another token's key, scaled by the square root of the head dimension to prevent the softmax from saturating in high-dimensional space.

The word **causal** is critical for language modeling. During training, the model should predict the next token using only past tokens — it must not 'cheat' by looking at future tokens. MicroGPT enforces this with a **causal mask**: an upper-triangular matrix of negative infinity values that is added to the attention scores before the softmax. After softmax, the masked positions become zero, meaning no attention weight flows from past positions to future ones. This masking is what transforms generic self-attention into the autoregressive attention needed for text generation.

**Multi-head** attention runs several attention operations in parallel with different learned projections, then concatenates their outputs. Each head can specialize in a different type of relationship — one head might track subject-verb agreement, another might follow coreference chains. The number of heads is a hyperparameter; MicroGPT keeps it small but retains this multi-head structure because it is fundamental to the GPT architecture.

microgpt.py — Head and MultiHeadAttention classesView in Code →
100 total = sum(exps)
101 return [e / total for e in exps]
102
103def rmsnorm(x):
104 ms = sum(xi * xi for xi in x) / len(x)
105 scale = (ms + 1e-5) ** -0.5
106 return [xi * scale for xi in x]
107
108def gpt(token_id, pos_id, keys, values):
109 tok_emb = state_dict['wte'][token_id] # token embedding
110 pos_emb = state_dict['wpe'][pos_id] # position embedding
111 x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint token and position embedding
112 x = rmsnorm(x) # note: not redundant due to backward pass via the residual connection
113
114 for li in range(n_layer):
115 # 1) Multi-head Attention block
116 x_residual = x
117 x = rmsnorm(x)
118 q = linear(x, state_dict[f'layer{li}.attn_wq'])
119 k = linear(x, state_dict[f'layer{li}.attn_wk'])
120 v = linear(x, state_dict[f'layer{li}.attn_wv'])
121 keys[li].append(k)
122 values[li].append(v)
123 x_attn = []
124 for h in range(n_head):
125 hs = h * head_dim
126 q_h = q[hs:hs+head_dim]
127 k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
128 v_h = [vi[hs:hs+head_dim] for vi in values[li]]
129 attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))]
130 attn_weights = softmax(attn_logits)
131 head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)]
132 x_attn.extend(head_out)
133 x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
134 x = [a + b for a, b in zip(x, x_residual)]
135 # 2) MLP block
136 x_residual = x
137 x = rmsnorm(x)
138 x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
139 x = [xi.relu() for xi in x]
140 x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
141 x = [a + b for a, b in zip(x, x_residual)]
142
143 logits = linear(x, state_dict['lm_head'])
144 return logits
145
146# Let there be Adam, the blessed optimizer and its buffers
147learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
148m = [0.0] * len(params) # first moment buffer
149v = [0.0] * len(params) # second moment buffer
150
151# Repeat in sequence
152num_steps = 1000 # number of training steps
153for step in range(num_steps):
154
155 # Take single document, tokenize it, surround it with BOS special token on both sides
156 doc = docs[step % len(docs)]
157 tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
158 n = min(block_size, len(tokens) - 1)
159
160 # Forward the token sequence through the model, building up the computation graph all the way to the loss
161 keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
162 losses = []
163 for pos_id in range(n):
164 token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
165 logits = gpt(token_id, pos_id, keys, values)
166 probs = softmax(logits)
167 loss_t = -probs[target_id].log()
168 losses.append(loss_t)
169 loss = (1 / n) * sum(losses) # final average loss over the document sequence. May yours be low.
170

The Feed-Forward Network: Per-Token Nonlinear Transformation

After multi-head attention allows tokens to communicate with each other, a **feed-forward network** (FFN) is applied independently to each token's representation. This is a simple two-layer MLP: a linear projection that expands the embedding dimension by a factor of 4, a nonlinear activation (typically ReLU or GELU), and a second linear projection that contracts back to the original embedding dimension.

Why expand and contract? The expansion creates a larger intermediate space where the model can perform more complex per-token computations. The factor of 4 is not theoretically derived — it is an empirical finding from the original 'Attention Is All You Need' paper that has been widely replicated. Think of the attention layer as the 'communication' step (tokens share information) and the FFN as the 'computation' step (each token thinks about what it learned).

Because the FFN is applied to each position independently and identically (the same weights for every position), it adds no additional positional bias. All position-sensitive computation happens in the attention layer. This separation of concerns — attention for routing information, FFN for processing it — is one of the elegant design properties of the transformer.

microgpt.py — FeedForward classView in Code →
165 logits = gpt(token_id, pos_id, keys, values)
166 probs = softmax(logits)
167 loss_t = -probs[target_id].log()
168 losses.append(loss_t)
169 loss = (1 / n) * sum(losses) # final average loss over the document sequence. May yours be low.
170
171 # Backward the loss, calculating the gradients with respect to all model parameters
172 loss.backward()
173
174 # Adam optimizer update: update the model parameters based on the corresponding gradients
175 lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay
176 for i, p in enumerate(params):
177 m[i] = beta1 * m[i] + (1 - beta1) * p.grad
178 v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
179 m_hat = m[i] / (1 - beta1 ** (step + 1))
180 v_hat = v[i] / (1 - beta2 ** (step + 1))
181 p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
182 p.grad = 0
183
184 print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}", end='\r')
185
186# Inference: may the model babble back to us
187temperature = 0.5 # in (0, 1], control the "creativity" of generated text, low to high
188print("\n--- inference (new, hallucinated names) ---")
189for sample_idx in range(20):
190 keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
191 token_id = BOS
192 sample = []
193 for pos_id in range(block_size):
194 logits = gpt(token_id, pos_id, keys, values)
195 probs = softmax([l / temperature for l in logits])
196 token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
197 if token_id == BOS:
198 break
199 sample.append(uchars[token_id])
200 print(f"sample {sample_idx+1:2d}: {''.join(sample)}")

Layer Normalization and Residual Connections: Stability at Scale

Two structural elements hold the transformer block together: **residual connections** and **layer normalization**. A residual connection adds the block's input directly to its output (`x = x + sublayer(x)`). This creates a gradient highway during backpropagation — gradients can flow directly from the loss to early layers without passing through all the multiplicative operations in the attention and FFN layers. Without residuals, deep transformers are extremely difficult to train.

**Layer normalization** normalizes the activations across the embedding dimension (not the batch dimension, unlike batch normalization). It is applied before each sublayer in MicroGPT's implementation — this 'pre-norm' arrangement, rather than the 'post-norm' arrangement used in the original transformer paper, has been found to produce more stable training in practice and is the convention followed by GPT-2 and later models.

Together, residuals and layer norm make it possible to stack many transformer blocks without training instability. MicroGPT keeps the number of layers small, but the architecture scales to dozens or hundreds of layers by the same mechanism. When you see `Block` — a class that wraps one attention layer and one FFN layer with these stabilization structures — recognize it as the fundamental repeating unit of every GPT-class model.

microgpt.py — Block class with residuals and LayerNormView in Code →
195 probs = softmax([l / temperature for l in logits])
196 token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
197 if token_id == BOS:
198 break
199 sample.append(uchars[token_id])
200 print(f"sample {sample_idx+1:2d}: {''.join(sample)}")
Embeddings and Positional Encoding: Giving Tokens Meaning and PositionThe GPT Model Class: Assembling the Full Forward Pass