Table of Contents
1
Orientation: One File, One Model, One Pipeline
2
Tokenization: From Text to Integers and Back
3
Embeddings and Positional Encoding: Giving Tokens Meaning and Position
4
The Transformer Block: Attention, Feed-Forward, and Layer Normalization
5
The GPT Model Class: Assembling the Full Forward Pass
6
Training Loop: Optimization and the Learning Process
7
Text Generation: Autoregressive Inference and Sampling
Library > Gist: 8627fe00... > Chapter 3

Embeddings and Positional Encoding: Giving Tokens Meaning and Position

Raw token IDs are just integers — they carry no geometric meaning that a neural network can exploit. This chapter explains how MicroGPT lifts those integers into continuous vector space through learned token embeddings, and how it injects positional information so the model can distinguish 'cat sat' from 'sat cat'. These two embedding tables are the model's entry point and deserve careful attention before moving to the more complex attention mechanism.

Token Embeddings: From Discrete IDs to Continuous Vectors

A **token embedding table** is simply a matrix of shape `[vocab_size, embedding_dim]`. Each row corresponds to one vocabulary entry and contains a learned vector of real numbers. When a token ID is fed into the model, the embedding layer performs a lookup — it retrieves the corresponding row. In PyTorch this is `nn.Embedding`, which is internally just an indexed matrix multiplication.

Why learn these vectors rather than use a fixed encoding like one-hot? Because the learning process causes semantically related tokens to occupy nearby regions of the embedding space. After training, the vector for 'k' and 'K' will be more similar to each other than either is to '7', because the model sees them used in similar contexts. This geometric structure in the embedding space is what allows the attention mechanism to reason about relationships between tokens.

The embedding dimension (often called `n_embd` or `d_model`) is a hyperparameter that controls the capacity of the model's internal representations. MicroGPT keeps this small by design — large enough to demonstrate the mechanism, small enough to train on a CPU or modest GPU in minutes.

microgpt.py — token embedding table definitionView in Code →
60 topo = []
61 visited = set()
62 def build_topo(v):
63 if v not in visited:
64 visited.add(v)
65 for child in v._children:
66 build_topo(child)
67 topo.append(v)
68 build_topo(self)
69 self.grad = 1
70 for v in reversed(topo):
71 for child, local_grad in zip(v._children, v._local_grads):
72 child.grad += local_grad * v.grad
73
74# Initialize the parameters, to store the knowledge of the model
75n_layer = 1 # depth of the transformer neural network (number of layers)
76n_embd = 16 # width of the network (embedding dimension)
77block_size = 16 # maximum context length of the attention window (note: the longest name is 15 characters)
78n_head = 4 # number of attention heads
79head_dim = n_embd // n_head # derived dimension of each head
80matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
81state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
82for i in range(n_layer):
83 state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
84 state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
85 state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
86 state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
87 state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
88 state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
89params = [p for mat in state_dict.values() for row in mat for p in row] # flatten params into a single list[Value]
90print(f"num params: {len(params)}")
91
92# Define the model architecture: a function mapping tokens and parameters to logits over what comes next
93# Follow GPT-2, blessed among the GPTs, with minor differences: layernorm -> rmsnorm, no biases, GeLU -> ReLU
94def linear(x, w):
95 return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
96
97def softmax(logits):
98 max_val = max(val.data for val in logits)
99 exps = [(val - max_val).exp() for val in logits]
100 total = sum(exps)

Positional Embeddings: Why Transformers Need Explicit Position Information

Self-attention, the core operation of a transformer, is **permutation-equivariant** by default: if you shuffle the input tokens, the attention outputs shuffle in the same way, but the relationship between any pair of tokens is unchanged. This means a pure attention model has no concept of word order — 'the dog bit the man' and 'the man bit the dog' would produce identical representations.

To fix this, transformers add a **positional embedding** to each token embedding. MicroGPT uses a second learned embedding table of shape `[block_size, embedding_dim]`, where `block_size` is the maximum sequence length. Position 0 gets one learned vector, position 1 gets another, and so on. These are added element-wise to the token embeddings before the first transformer block sees them.

This approach — learned positional embeddings — is the same one used in the original GPT paper. An alternative is sinusoidal positional encodings (used in 'Attention Is All You Need'), which are fixed mathematical functions of position rather than learned. Both work well; MicroGPT chooses learned embeddings for simplicity and consistency, since both embedding tables are initialized and updated the same way during training.

After the addition of token and positional embeddings, you have a tensor of shape `[batch_size, sequence_length, embedding_dim]` — a sequence of rich vectors, each encoding both *what* a token is and *where* it appears. This tensor flows into the transformer blocks described in the next chapter.

microgpt.py — positional embedding table and embedding summationView in Code →
90print(f"num params: {len(params)}")
91
92# Define the model architecture: a function mapping tokens and parameters to logits over what comes next
93# Follow GPT-2, blessed among the GPTs, with minor differences: layernorm -> rmsnorm, no biases, GeLU -> ReLU
94def linear(x, w):
95 return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
96
97def softmax(logits):
98 max_val = max(val.data for val in logits)
99 exps = [(val - max_val).exp() for val in logits]
100 total = sum(exps)
101 return [e / total for e in exps]
102
103def rmsnorm(x):
104 ms = sum(xi * xi for xi in x) / len(x)
105 scale = (ms + 1e-5) ** -0.5
106 return [xi * scale for xi in x]
107
108def gpt(token_id, pos_id, keys, values):
109 tok_emb = state_dict['wte'][token_id] # token embedding
110 pos_emb = state_dict['wpe'][pos_id] # position embedding
111 x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint token and position embedding
112 x = rmsnorm(x) # note: not redundant due to backward pass via the residual connection
113
114 for li in range(n_layer):
115 # 1) Multi-head Attention block
116 x_residual = x
117 x = rmsnorm(x)
118 q = linear(x, state_dict[f'layer{li}.attn_wq'])
119 k = linear(x, state_dict[f'layer{li}.attn_wk'])
120 v = linear(x, state_dict[f'layer{li}.attn_wv'])
121 keys[li].append(k)
122 values[li].append(v)
123 x_attn = []
124 for h in range(n_head):
125 hs = h * head_dim
126 q_h = q[hs:hs+head_dim]
127 k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
128 v_h = [vi[hs:hs+head_dim] for vi in values[li]]
129 attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))]
130 attn_weights = softmax(attn_logits)
Tokenization: From Text to Integers and BackThe Transformer Block: Attention, Feed-Forward, and Layer Normalization