Every language model begins with a mapping between human-readable text and machine-readable integers. This chapter covers MicroGPT's tokenization layer — how a vocabulary is constructed from raw text, how strings are encoded into token ID sequences, and how those sequences are decoded back into text. Because MicroGPT uses a character-level tokenizer, this layer is simple enough to understand completely in minutes, which makes it an ideal starting point before encountering the more complex model machinery.
MicroGPT uses **character-level tokenization**, meaning each unique character in the training corpus becomes one vocabulary entry. This is the simplest tokenization scheme possible — simpler than byte-pair encoding (BPE) used by GPT-2/3/4, or WordPiece used by BERT. The vocabulary size equals the number of distinct characters in your dataset, typically 65–100 for English text.
The trade-off is explicit: character-level tokenization produces longer sequences (every word like 'hello' becomes five tokens instead of one), which means the model must learn to compose characters into words and words into meaning entirely through its attention mechanism. This makes the learning problem harder but keeps the tokenizer code trivially simple — no external library, no pre-trained vocabulary file, no byte-fallback logic.
For an educational implementation, this is exactly the right call. You can fully understand the tokenizer in two minutes, leaving your full attention for the model architecture that matters. In a production system you would swap this for a BPE tokenizer, but the rest of the model code would not change.
The tokenizer exposes two core operations. **Encoding** maps a string to a list of integers using a character-to-index dictionary (`stoi` — string to integer). **Decoding** performs the reverse using an index-to-character dictionary (`itos` — integer to string). Both dictionaries are constructed once from the training corpus and reused throughout training and inference.
Notice that these are plain Python dictionaries, not learned parameters. The mapping is fixed before any model training begins. This is a fundamental property of all tokenizers — they are deterministic lookup tables, not neural networks. The neural network learns what to *do* with token IDs; the tokenizer only decides *which* ID represents each input unit.
When you later see the training loop feed integer tensors into the model, remember that those integers originated here. And when you see the generation routine convert model outputs back to text, it calls the decode function defined in this section. Trace those call sites as you read forward — they ground abstract tensor operations back in human-readable text.