This chapter introduces the foundational data structure of the entire library: the Value class in engine.py. You will learn how a simple Python class wraps a scalar number, records its history of operations, and lays the groundwork for automatic differentiation. This is the single most important file in the repository — everything else depends on understanding it.
At its core, `Value` is a thin wrapper around a Python float. But the crucial addition is that every `Value` remembers **how it was created**: which other `Value` objects were its inputs (`_prev`) and what operation produced it (`_op`). This metadata is what makes automatic differentiation possible — it is the computation graph stored implicitly in the objects themselves.
The `__init__` method initializes three important attributes alongside the scalar `data`: `grad` starts at `0.0` (no gradient has been computed yet), `_backward` is a no-op lambda by default (leaf nodes have no children to propagate to), and `_prev` is stored as a `frozenset` of parent nodes. Using a `frozenset` instead of a list is a deliberate choice: the set of parents is unordered and fixed after construction, and immutability prevents accidental mutation of the graph structure.
The `label` attribute is purely for debugging and visualization — it has no role in computation. This separation of concerns (computation vs. display) is a good design habit even in a minimal codebase.
Every arithmetic operation on a `Value` creates a new `Value` that represents the result, and simultaneously defines a **closure** that knows how to backpropagate gradients through that operation. This is the heart of the dynamic computation graph approach: the graph is built lazily, operation by operation, during the ordinary forward computation.
Consider `__add__`: it creates an output `Value` whose `data` is `self.data + other.data`. Then it defines a nested function `_backward` that, when called, adds `out.grad` (the gradient flowing in from downstream) to both `self.grad` and `other.grad`. This is a direct application of the **chain rule** — the local gradient of addition with respect to either input is 1, so the upstream gradient passes through unchanged. The closure **captures** `self`, `other`, and `out` by reference, which means it always has access to the current gradient state of those nodes.
Multiplication follows the same pattern but with the product rule: the gradient for `self` is `other.data * out.grad`, and vice versa. Notice that both closures read `.data` at backprop time, not at graph-construction time — this is correct because `data` values are fixed after the forward pass. The `__pow__` operation handles the case where the exponent is a raw Python `int` or `float` (not a `Value`), which is an intentional simplification — raising a `Value` to a `Value` power would require a more complex gradient rule and is not needed for basic neural networks.
The `relu` method implements the Rectified Linear Unit activation function: it returns `max(0, self.data)` wrapped in a new `Value`. The backward closure applies the subgradient rule: if the output was positive, the gradient flows through unchanged; if the output was zero or negative, the gradient is zeroed out (the neuron is 'dead' for this input).
This is implemented with `(out.data > 0) * out.grad` — a neat Python idiom where a boolean is multiplied by a float, yielding either `out.grad` or `0.0`. The `relu` method is defined directly on `Value` rather than being composed from primitives like `max`, because it is a fundamental building block for neural networks and its gradient is simple and well-defined as a special case. More complex activation functions (like `tanh` or `sigmoid`) could be added by the same pattern: define the forward computation, then write the corresponding backward closure.