Table of Contents
1
Orientation: One File, One Model, One Pipeline
2
Tokenization: From Text to Integers and Back
3
Embeddings and Positional Encoding: Giving Tokens Meaning and Position
4
The Transformer Block: Attention, Feed-Forward, and Layer Normalization
5
The GPT Model Class: Assembling the Full Forward Pass
6
Training Loop: Optimization and the Learning Process
7
Text Generation: Autoregressive Inference and Sampling
Library > Gist: 8627fe00... > Chapter 1

Orientation: One File, One Model, One Pipeline

Before diving into code, this chapter orients you to the philosophy and structure of MicroGPT. You will learn why this codebase is intentionally minimalist, how it is organized top-to-bottom as a linear narrative, and what mental model to carry into the subsequent chapters. Understanding the design intent upfront prevents confusion about what is 'missing' — nothing is missing; everything is a deliberate choice for clarity.

Why a Single File?

MicroGPT makes an unusual architectural decision for a machine learning project: it places every component — tokenizer, model layers, training loop, and inference — in a single Python file. This is not laziness or poor engineering; it is a pedagogical statement. When code lives in one file, a reader can trace data from raw text through every transformation to predicted tokens without ever switching files or hunting down imports.

In larger frameworks like Hugging Face Transformers or even Karpathy's nanoGPT, the same concepts are spread across dozens of files organized by concern. That structure is excellent for production use but creates a 'which file do I read first?' problem for learners. MicroGPT eliminates that problem entirely — you start at line 1 and read forward.

As you read, resist the urge to jump around. The file is ordered deliberately: data handling utilities appear before the model that consumes them, and model components appear before the training loop that orchestrates them. This linear dependency order means every concept you encounter has been prepared for by what came before it.

microgpt.py — module header and importsView in Code →
1"""
2The most atomic way to train and run inference for a GPT in pure, dependency-free Python.
3This file is the complete algorithm.
4Everything else is just efficiency.
5
6@karpathy
7"""
8
9import os # os.path.exists
10import math # math.log, math.exp
11import random # random.seed, random.choices, random.gauss, random.shuffle
12random.seed(42) # Let there be order among chaos
13
14# Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names)
15if not os.path.exists('input.txt'):
16 import urllib.request
17 names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
18 urllib.request.urlretrieve(names_url, 'input.txt')
19docs = [line.strip() for line in open('input.txt') if line.strip()]
20random.shuffle(docs)
21print(f"num docs: {len(docs)}")
22
23# Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back
24uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
25BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
26vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS
27print(f"vocab size: {vocab_size}")
28
29# Let there be Autograd to recursively apply the chain rule through a computation graph
30class Value:

The Full Pipeline at a Glance

A GPT-style model transforms a sequence of text tokens into a probability distribution over the next possible token, then samples from that distribution. Training teaches the model to make accurate predictions; inference uses those predictions to generate new text autoregressively — one token at a time, feeding each prediction back as the next input.

MicroGPT implements this pipeline in five logical stages that map directly to sections of the file: (1) **tokenization** converts raw text strings into integer sequences and back; (2) **positional and token embeddings** lift those integers into continuous vector space; (3) **transformer blocks** — stacked layers of self-attention and feed-forward networks — refine those vectors into context-aware representations; (4) the **GPT model class** assembles these blocks into a complete forward pass; and (5) the **training loop and generation routine** drive learning and produce new text.

Keep this five-stage mental model in mind as you read. When you encounter a function or class, ask yourself: which stage does this belong to? That question will anchor every detail in a larger purpose.

microgpt.py — overall file structureView in Code →
1"""
2The most atomic way to train and run inference for a GPT in pure, dependency-free Python.
3This file is the complete algorithm.
4Everything else is just efficiency.
5
6@karpathy
7"""
8
9import os # os.path.exists
10import math # math.log, math.exp
11import random # random.seed, random.choices, random.gauss, random.shuffle
12random.seed(42) # Let there be order among chaos
13
14# Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names)
15if not os.path.exists('input.txt'):
16 import urllib.request
17 names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
18 urllib.request.urlretrieve(names_url, 'input.txt')
19docs = [line.strip() for line in open('input.txt') if line.strip()]
20random.shuffle(docs)
21print(f"num docs: {len(docs)}")
22
23# Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back
24uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
25BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
26vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS
27print(f"vocab size: {vocab_size}")
28
29# Let there be Autograd to recursively apply the chain rule through a computation graph
30class Value:
31 __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage
32
33 def __init__(self, data, children=(), local_grads=()):
34 self.data = data # scalar value of this node calculated during forward pass
35 self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass
36 self._children = children # children of this node in the computation graph
37 self._local_grads = local_grads # local derivative of this node w.r.t. its children
38
39 def __add__(self, other):
40 other = other if isinstance(other, Value) else Value(other)
41 return Value(self.data + other.data, (self, other), (1, 1))
42
43 def __mul__(self, other):
44 other = other if isinstance(other, Value) else Value(other)
45 return Value(self.data * other.data, (self, other), (other.data, self.data))
46
47 def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))
48 def log(self): return Value(math.log(self.data), (self,), (1/self.data,))
49 def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))
50 def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))
Tokenization: From Text to Integers and Back