With embeddings and transformer blocks defined, this chapter examines the top-level GPT class that assembles these components into a complete model. You will trace the full forward pass from a batch of token IDs to a batch of logits (or a loss value), and understand the final language modeling head that projects transformer outputs back into vocabulary space. This is where all the pieces come together.
The `GPT` class (or equivalent top-level model class in MicroGPT) holds three categories of parameters: the token embedding table, the positional embedding table, and a sequence of `Block` instances (the transformer layers). The number of blocks, the embedding dimension, the number of attention heads, and the context length are all passed as hyperparameters at construction time — nothing is hardcoded into the class body.
This parameterization is important: by changing a handful of integers, you can make MicroGPT as small as a toy (useful for understanding) or as large as your hardware permits (useful for experimentation). The architecture scales smoothly because the `Block` class is fully self-contained and the GPT class simply stacks N of them in a `nn.Sequential` or equivalent container.
A final layer normalization is applied after the last block and before the language modeling head. This is consistent with the pre-norm convention used throughout the blocks — the very last token representations receive normalization before being projected to vocabulary logits.
The `forward` method of the GPT class orchestrates the complete computation. It accepts a tensor of token IDs of shape `[batch, sequence_length]` and optionally a tensor of target IDs for computing training loss. The flow is: look up token embeddings → look up positional embeddings → add them → pass through each transformer block in sequence → apply final layer norm → project to logits via the language modeling head.
The **language modeling head** is a single linear layer that projects from `embedding_dim` to `vocab_size`, producing one logit per vocabulary entry per sequence position. These logits represent the model's unnormalized confidence that each vocabulary token is the correct next token. During training, a cross-entropy loss compares these logits against the actual next tokens (the targets). During inference, a softmax converts logits to probabilities, from which the next token is sampled.
Notice that the model produces a logit for every position in the sequence simultaneously during a forward pass — this is what makes transformer training efficient. Rather than making N sequential predictions for a sequence of length N, the causal mask ensures each position only attends to prior positions, so the model can learn from all N next-token prediction problems in one parallel forward pass.