A model architecture is only half the picture — the training loop is what teaches the model's parameters to produce useful outputs. This chapter covers MicroGPT's data loading strategy, the optimization step, and the practical decisions (batch size, learning rate, evaluation intervals) that make training tractable. Understanding this loop demystifies how the model goes from random weights to coherent text generation.
MicroGPT's data loader is intentionally simple. The entire training corpus is loaded into memory as a single flat tensor of token IDs. To create a training batch, the loader randomly samples `batch_size` start positions, then extracts a window of `block_size` consecutive tokens starting at each position. The input sequence `x` is tokens at positions `[i : i + block_size]` and the target sequence `y` is tokens at positions `[i+1 : i + block_size + 1]` — shifted by one, because predicting the next token is exactly the language modeling objective.
This random window sampling has an important property: every training step sees a different subset of the corpus in a different arrangement, which acts as a strong form of data augmentation and prevents the model from memorizing the exact order of examples. It also means the training loop is stateless with respect to data — there is no epoch counter or dataset iterator to manage.
The simplicity here is deliberate. Production training pipelines add shuffling, multi-process data loading, and streaming for large corpora. MicroGPT omits all of that to keep the training loop readable in a single sitting.
Each training step follows the standard PyTorch pattern: compute logits via the forward pass, compute cross-entropy loss against targets, call `loss.backward()` to compute gradients through the entire computation graph, call `optimizer.step()` to update all parameters in the direction that reduces loss, and call `optimizer.zero_grad()` to clear gradients before the next step.
MicroGPT uses the **AdamW optimizer**, which is Adam with decoupled weight decay. Adam adapts the learning rate for each parameter individually based on the history of its gradients, which makes it far more robust to hyperparameter choices than plain SGD. Weight decay adds a small penalty proportional to the magnitude of each parameter, which regularizes the model and prevents weights from growing unboundedly — the 'W' in AdamW means this decay is applied correctly, separate from the adaptive moment estimates, which is a subtle but important fix over the original Adam formulation.
Periodically during training (controlled by an `eval_interval` hyperparameter), the loop pauses to estimate the loss on both the training split and a held-out validation split. This is purely diagnostic — it tells you whether the model is learning and whether it is overfitting. MicroGPT averages the loss over several batches for a stable estimate rather than relying on the noisy single-batch loss from the last training step.