Table of Contents
1
Orientation: One File, One Model, One Pipeline
2
Tokenization: From Text to Integers and Back
3
Embeddings and Positional Encoding: Giving Tokens Meaning and Position
4
The Transformer Block: Attention, Feed-Forward, and Layer Normalization
5
The GPT Model Class: Assembling the Full Forward Pass
6
Training Loop: Optimization and the Learning Process
7
Text Generation: Autoregressive Inference and Sampling
Library > Gist: 8627fe00... > Chapter 2

Tokenization: From Text to Integers and Back

Every language model begins with a mapping between human-readable text and machine-readable integers. This chapter covers MicroGPT's tokenization layer — how a vocabulary is constructed from raw text, how strings are encoded into token ID sequences, and how those sequences are decoded back into text. Because MicroGPT uses a character-level tokenizer, this layer is simple enough to understand completely in minutes, which makes it an ideal starting point before encountering the more complex model machinery.

Character-Level Tokenization: The Simplest Possible Vocabulary

MicroGPT uses **character-level tokenization**, meaning each unique character in the training corpus becomes one vocabulary entry. This is the simplest tokenization scheme possible — simpler than byte-pair encoding (BPE) used by GPT-2/3/4, or WordPiece used by BERT. The vocabulary size equals the number of distinct characters in your dataset, typically 65–100 for English text.

The trade-off is explicit: character-level tokenization produces longer sequences (every word like 'hello' becomes five tokens instead of one), which means the model must learn to compose characters into words and words into meaning entirely through its attention mechanism. This makes the learning problem harder but keeps the tokenizer code trivially simple — no external library, no pre-trained vocabulary file, no byte-fallback logic.

For an educational implementation, this is exactly the right call. You can fully understand the tokenizer in two minutes, leaving your full attention for the model architecture that matters. In a production system you would swap this for a BPE tokenizer, but the rest of the model code would not change.

microgpt.py — vocabulary constructionView in Code →
1"""
2The most atomic way to train and run inference for a GPT in pure, dependency-free Python.
3This file is the complete algorithm.
4Everything else is just efficiency.
5
6@karpathy
7"""
8
9import os # os.path.exists
10import math # math.log, math.exp
11import random # random.seed, random.choices, random.gauss, random.shuffle
12random.seed(42) # Let there be order among chaos
13
14# Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names)
15if not os.path.exists('input.txt'):
16 import urllib.request
17 names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
18 urllib.request.urlretrieve(names_url, 'input.txt')
19docs = [line.strip() for line in open('input.txt') if line.strip()]
20random.shuffle(docs)
21print(f"num docs: {len(docs)}")
22
23# Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back
24uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
25BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
26vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS
27print(f"vocab size: {vocab_size}")
28
29# Let there be Autograd to recursively apply the chain rule through a computation graph
30class Value:
31 __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage
32
33 def __init__(self, data, children=(), local_grads=()):
34 self.data = data # scalar value of this node calculated during forward pass
35 self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass
36 self._children = children # children of this node in the computation graph
37 self._local_grads = local_grads # local derivative of this node w.r.t. its children
38
39 def __add__(self, other):
40 other = other if isinstance(other, Value) else Value(other)

Encode and Decode: The Two Essential Functions

The tokenizer exposes two core operations. **Encoding** maps a string to a list of integers using a character-to-index dictionary (`stoi` — string to integer). **Decoding** performs the reverse using an index-to-character dictionary (`itos` — integer to string). Both dictionaries are constructed once from the training corpus and reused throughout training and inference.

Notice that these are plain Python dictionaries, not learned parameters. The mapping is fixed before any model training begins. This is a fundamental property of all tokenizers — they are deterministic lookup tables, not neural networks. The neural network learns what to *do* with token IDs; the tokenizer only decides *which* ID represents each input unit.

When you later see the training loop feed integer tensors into the model, remember that those integers originated here. And when you see the generation routine convert model outputs back to text, it calls the decode function defined in this section. Trace those call sites as you read forward — they ground abstract tensor operations back in human-readable text.

microgpt.py — encode and decode functionsView in Code →
30class Value:
31 __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage
32
33 def __init__(self, data, children=(), local_grads=()):
34 self.data = data # scalar value of this node calculated during forward pass
35 self.grad = 0 # derivative of the loss w.r.t. this node, calculated in backward pass
36 self._children = children # children of this node in the computation graph
37 self._local_grads = local_grads # local derivative of this node w.r.t. its children
38
39 def __add__(self, other):
40 other = other if isinstance(other, Value) else Value(other)
41 return Value(self.data + other.data, (self, other), (1, 1))
42
43 def __mul__(self, other):
44 other = other if isinstance(other, Value) else Value(other)
45 return Value(self.data * other.data, (self, other), (other.data, self.data))
46
47 def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))
48 def log(self): return Value(math.log(self.data), (self,), (1/self.data,))
49 def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))
50 def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))
51 def __neg__(self): return self * -1
52 def __radd__(self, other): return self + other
53 def __sub__(self, other): return self + (-other)
54 def __rsub__(self, other): return other + (-self)
55 def __rmul__(self, other): return self * other
56 def __truediv__(self, other): return self * other**-1
57 def __rtruediv__(self, other): return other * self**-1
58
59 def backward(self):
60 topo = []
Orientation: One File, One Model, One PipelineEmbeddings and Positional Encoding: Giving Tokens Meaning and Position