With the computation graph built during the forward pass, this chapter explains how gradients are computed in reverse order via the .backward() method. You will learn why a topological sort is necessary, how the gradient of the final output is seeded, and how the chain rule is applied automatically across the entire graph.
Backpropagation requires that we compute each node's gradient **after** all nodes that depend on it have already received their gradients. In other words, we must process the graph in reverse topological order — from outputs back to inputs. The `backward` method builds this ordering using a standard depth-first search (DFS) that recursively visits each node's `_prev` parents and appends the current node to a list only after all its descendants have been visited.
The `visited` set prevents revisiting nodes that appear multiple times in the graph (e.g., when a value is used in two different operations). Without this guard, the DFS would loop infinitely on any shared node. Once the full topological order is collected, it is reversed with `[::-1]` so that the output node comes first and leaf nodes come last — exactly the order in which `_backward` closures should be called.
This sort is performed **inside** `.backward()` on every call, rather than being precomputed and cached. This is a deliberate simplicity trade-off: for the educational scale micrograd targets, recomputing the sort is negligible, and it avoids the complexity of invalidating a cached order when the graph changes.
Before traversing the graph, `backward` sets `self.grad = 1.0`. This seeds the backpropagation with the fundamental identity: the derivative of a scalar output with respect to itself is 1. Every other gradient in the graph will be expressed relative to this root gradient through the chain rule.
The main loop is beautifully simple: iterate through the topologically-sorted list of nodes and call each node's `_backward()` closure. Each closure reads `out.grad` (already set by a previous iteration) and **accumulates** — using `+=` — into the gradients of its parent nodes. Accumulation with `+=` rather than assignment `=` is critical: if the same `Value` is used as input to multiple operations (a shared node), its gradient must be the **sum** of all incoming gradient contributions. This is the correct behavior by the multivariate chain rule, and `+=` achieves it without any special-casing.
Leaf nodes (inputs and parameters) end up with their `.grad` populated after the loop finishes. These are the gradients you use to update model parameters during training.
The lower half of `engine.py` defines a series of convenience methods that complete Python's operator protocol. Methods like `__neg__`, `__radd__`, `__rsub__`, `__rmul__`, and `__truediv__` ensure that expressions like `2.0 * value` or `1 / value` work correctly even when the `Value` is on the **right-hand side** of the operator. Python calls `__radd__` on the right operand when the left operand does not know how to handle the operation.
Crucially, these are all implemented by **reducing to the primitive operations** already defined. For example, `__neg__` is `-1 * self`, `__sub__` is `self + (-other)`, and `__truediv__` is `self * other**-1`. This means no new backward closures need to be written — the gradient logic is inherited from `__mul__`, `__add__`, and `__pow__`. This compositional approach is elegant: a small set of primitives with correct gradients generates the full arithmetic DSL.