Table of Contents

Embeddings and Positional Encoding: Giving Tokens Meaning and Position

Raw token IDs are just integers — they carry no geometric meaning that a neural network can exploit. This chapter explains how MicroGPT lifts those integers into continuous vector space through learned token embeddings, and how it injects positional information so the model can distinguish 'cat sat' from 'sat cat'. These two embedding tables are the model's entry point and deserve careful attention before moving to the more complex attention mechanism.

Token Embeddings: From Discrete IDs to Continuous Vectors

A **token embedding table** is simply a matrix of shape `[vocab_size, embedding_dim]`. Each row corresponds to one vocabulary entry and contains a learned vector of real numbers. When a token ID is fed into the model, the embedding layer performs a lookup — it retrieves the corresponding row. In PyTorch this is `nn.Embedding`, which is internally just an indexed matrix multiplication.

Why learn these vectors rather than use a fixed encoding like one-hot? Because the learning process causes semantically related tokens to occupy nearby regions of the embedding space. After training, the vector for 'k' and 'K' will be more similar to each other than either is to '7', because the model sees them used in similar contexts. This geometric structure in the embedding space is what allows the attention mechanism to reason about relationships between tokens.

The embedding dimension (often called `n_embd` or `d_model`) is a hyperparameter that controls the capacity of the model's internal representations. MicroGPT keeps this small by design — large enough to demonstrate the mechanism, small enough to train on a CPU or modest GPU in minutes.

microgpt.py — token embedding table definition (lines 60-100)

Positional Embeddings: Why Transformers Need Explicit Position Information

Self-attention, the core operation of a transformer, is **permutation-equivariant** by default: if you shuffle the input tokens, the attention outputs shuffle in the same way, but the relationship between any pair of tokens is unchanged. This means a pure attention model has no concept of word order — 'the dog bit the man' and 'the man bit the dog' would produce identical representations.

To fix this, transformers add a **positional embedding** to each token embedding. MicroGPT uses a second learned embedding table of shape `[block_size, embedding_dim]`, where `block_size` is the maximum sequence length. Position 0 gets one learned vector, position 1 gets another, and so on. These are added element-wise to the token embeddings before the first transformer block sees them.

This approach — learned positional embeddings — is the same one used in the original GPT paper. An alternative is sinusoidal positional encodings (used in 'Attention Is All You Need'), which are fixed mathematical functions of position rather than learned. Both work well; MicroGPT chooses learned embeddings for simplicity and consistency, since both embedding tables are initialized and updated the same way during training.

After the addition of token and positional embeddings, you have a tensor of shape `[batch_size, sequence_length, embedding_dim]` — a sequence of rich vectors, each encoding both *what* a token is and *where* it appears. This tensor flows into the transformer blocks described in the next chapter.

microgpt.py — positional embedding table and embedding summation (lines 90-130)

← Tokenization: From Text to Integers and Back The Transformer Block: Attention, Feed-Forward, and Layer Normalization →