Library > Gist: 8627fe00...

This repository contains a single-file, self-contained implementation of a GPT-style transformer language model called MicroGPT. It covers the full pipeline from tokenization through model architecture, training loop, and text generation inference, all within one compact Python file. Key design decisions include minimalism and educational clarity — the implementation deliberately avoids external deep learning framework abstractions beyond the core tensor operations, making every component visible and understandable in one place. This project is ideal for developers, students, and researchers who want to understand how GPT-style models work at a fundamental level without navigating a large codebase, or who need a lightweight starting point for experimentation with transformer architectures.

Start Reading →Browse All Files

Reading Guide

1
Orientation: One File, One Model, One Pipeline
Before diving into code, this chapter orients you to the philosophy and structure of MicroGPT. You will learn why this codebase is intentionally minimalist, how it is organized top-to-bottom as a linear narrative, and what mental model to carry into the subsequent chapters. Understanding the design intent upfront prevents confusion about what is 'missing' — nothing is missing; everything is a deliberate choice for clarity.
2
Tokenization: From Text to Integers and Back
Every language model begins with a mapping between human-readable text and machine-readable integers. This chapter covers MicroGPT's tokenization layer — how a vocabulary is constructed from raw text, how strings are encoded into token ID sequences, and how those sequences are decoded back into text. Because MicroGPT uses a character-level tokenizer, this layer is simple enough to understand completely in minutes, which makes it an ideal starting point before encountering the more complex model machinery.
3
Embeddings and Positional Encoding: Giving Tokens Meaning and Position
Raw token IDs are just integers — they carry no geometric meaning that a neural network can exploit. This chapter explains how MicroGPT lifts those integers into continuous vector space through learned token embeddings, and how it injects positional information so the model can distinguish 'cat sat' from 'sat cat'. These two embedding tables are the model's entry point and deserve careful attention before moving to the more complex attention mechanism.
4
The Transformer Block: Attention, Feed-Forward, and Layer Normalization
The transformer block is the heart of GPT. This chapter dissects MicroGPT's implementation of multi-head causal self-attention, the position-wise feed-forward network, and the layer normalization that stabilizes training. Each of these components has a specific, well-motivated role, and understanding why they exist together is as important as understanding what each one computes individually.
5
The GPT Model Class: Assembling the Full Forward Pass
With embeddings and transformer blocks defined, this chapter examines the top-level GPT class that assembles these components into a complete model. You will trace the full forward pass from a batch of token IDs to a batch of logits (or a loss value), and understand the final language modeling head that projects transformer outputs back into vocabulary space. This is where all the pieces come together.
6
Training Loop: Optimization and the Learning Process
A model architecture is only half the picture — the training loop is what teaches the model's parameters to produce useful outputs. This chapter covers MicroGPT's data loading strategy, the optimization step, and the practical decisions (batch size, learning rate, evaluation intervals) that make training tractable. Understanding this loop demystifies how the model goes from random weights to coherent text generation.
7
Text Generation: Autoregressive Inference and Sampling
The final chapter covers inference — how a trained MicroGPT model generates new text. You will learn the autoregressive generation loop, the role of the temperature hyperparameter in controlling output randomness, and how top-k sampling provides a practical balance between diversity and coherence. This chapter closes the loop from the tokenizer (Chapter 2) through the model (Chapters 3–5) back to human-readable text.

Architecture

Root
A minimal GPT implementation providing a self-contained micro-scale language model with training and inference capabilities.

Entry Points

microgpt.py