The final chapter covers inference — how a trained MicroGPT model generates new text. You will learn the autoregressive generation loop, the role of the temperature hyperparameter in controlling output randomness, and how top-k sampling provides a practical balance between diversity and coherence. This chapter closes the loop from the tokenizer (Chapter 2) through the model (Chapters 3–5) back to human-readable text.
Text generation with a GPT model is **autoregressive**: the model generates one token at a time, and each newly generated token is appended to the input context before generating the next one. The `generate` method implements this loop. It accepts an initial context (a tensor of token IDs representing a prompt), runs it through the model's forward pass, extracts the logits for the *last* position only (since that position predicts the next token), samples a token from those logits, appends it to the context, and repeats.
This token-by-token generation is inherently sequential and cannot be parallelized across the time dimension — generating a sequence of 100 tokens requires 100 separate forward passes. This is the fundamental inference bottleneck of all autoregressive language models, and it is why inference optimization (KV-caching, speculative decoding, etc.) is such an active research area. MicroGPT omits these optimizations to keep the generation code maximally readable.
Notice the context cropping step inside the loop: if the accumulated context grows longer than `block_size`, it is cropped to the last `block_size` tokens before each forward pass. This is necessary because the positional embedding table only has entries for positions 0 through `block_size - 1`. Cropping ensures the model always sees a valid-length input, at the cost of 'forgetting' tokens that fall outside its context window.
The logits produced by the model are scaled by a **temperature** parameter before the softmax converts them to probabilities. A temperature of 1.0 leaves the distribution unchanged. A temperature below 1.0 sharpens the distribution — high-probability tokens become even more likely, and the model produces more predictable, repetitive output. A temperature above 1.0 flattens the distribution — the model takes more risks and produces more surprising (but potentially less coherent) output. Temperature is the primary dial for controlling the creativity-vs-coherence tradeoff.
**Top-K sampling** further constrains the distribution by setting the probability of all tokens outside the top-K most likely to zero before sampling. This prevents the model from ever generating very unlikely tokens that would feel like random noise, regardless of temperature. After zeroing out low-probability tokens, the remaining probabilities are renormalized and a single token is sampled from this pruned distribution using `torch.multinomial`.
Together, temperature and top-K sampling are the standard inference-time controls used by GPT-style models. They are post-processing steps applied to the raw model logits — the trained model weights are unchanged by these choices. This separation means you can experiment with different sampling strategies without retraining, which is a useful property for anyone using MicroGPT as an experimentation platform. After sampling, the selected token ID is passed through the `decode` function from Chapter 2 to recover the corresponding character, completing the full pipeline from model output back to human-readable text.