Before diving into code, this chapter orients you to the philosophy and structure of MicroGPT. You will learn why this codebase is intentionally minimalist, how it is organized top-to-bottom as a linear narrative, and what mental model to carry into the subsequent chapters. Understanding the design intent upfront prevents confusion about what is 'missing' — nothing is missing; everything is a deliberate choice for clarity.
MicroGPT makes an unusual architectural decision for a machine learning project: it places every component — tokenizer, model layers, training loop, and inference — in a single Python file. This is not laziness or poor engineering; it is a pedagogical statement. When code lives in one file, a reader can trace data from raw text through every transformation to predicted tokens without ever switching files or hunting down imports.
In larger frameworks like Hugging Face Transformers or even Karpathy's nanoGPT, the same concepts are spread across dozens of files organized by concern. That structure is excellent for production use but creates a 'which file do I read first?' problem for learners. MicroGPT eliminates that problem entirely — you start at line 1 and read forward.
As you read, resist the urge to jump around. The file is ordered deliberately: data handling utilities appear before the model that consumes them, and model components appear before the training loop that orchestrates them. This linear dependency order means every concept you encounter has been prepared for by what came before it.
A GPT-style model transforms a sequence of text tokens into a probability distribution over the next possible token, then samples from that distribution. Training teaches the model to make accurate predictions; inference uses those predictions to generate new text autoregressively — one token at a time, feeding each prediction back as the next input.
MicroGPT implements this pipeline in five logical stages that map directly to sections of the file: (1) **tokenization** converts raw text strings into integer sequences and back; (2) **positional and token embeddings** lift those integers into continuous vector space; (3) **transformer blocks** — stacked layers of self-attention and feed-forward networks — refine those vectors into context-aware representations; (4) the **GPT model class** assembles these blocks into a complete forward pass; and (5) the **training loop and generation routine** drive learning and produce new text.
Keep this five-stage mental model in mind as you read. When you encounter a function or class, ask yourself: which stage does this belong to? That question will anchor every detail in a larger purpose.