Table of Contents

Neural Network Abstractions: Neuron, Layer, and MLP

With the autograd engine in place, nn.py builds three composable neural network abstractions on top of it. This chapter walks through how weights and biases are represented as Value objects, how the forward pass of a neuron is computed, and how these primitives compose into full multi-layer perceptrons.

The Module Base Class: Parameters and Gradient Zeroing

The `Module` class is a minimal abstract base that all neural network components inherit from. It defines two methods: `zero_grad()` and `parameters()`. The `parameters()` method is abstract in spirit — each subclass overrides it to return a flat list of all `Value` objects that are trainable weights.

`zero_grad()` iterates over all parameters and resets their `.grad` to `0.0`. This must be called **before** each call to `.backward()` because gradients accumulate with `+=`. Without zeroing, gradients from previous training steps would pollute the current step's gradients. This is the same pattern as `optimizer.zero_grad()` in PyTorch, and seeing it here makes the reason for that call obvious: the accumulation behavior in `.backward()` is intentional and useful (for shared nodes within one forward pass), but cross-step accumulation is not.

micrograd/nn.py — Module (lines 4-11)

Neuron: Weights, Bias, and a Single Forward Pass

A `Neuron` represents a single artificial neuron: it holds a list of weight `Value`s (one per input feature) and a single bias `Value`. Weights are initialized with `random.uniform(-1, 1)` and the bias is initialized to zero. Each is wrapped in a `Value`, making them part of the computation graph and giving them a `.grad` that will be populated by backpropagation.

The `__call__` method computes the dot product of the input vector `x` with the weights, adds the bias, and optionally applies ReLU. The dot product is computed with `sum((wi * xi for wi, xi in zip(self.w, x)), start=Value(0.0))`. The `start=Value(0.0)` argument is required because Python's built-in `sum` starts from the integer `0` by default, and `0 + Value(...)` would call `int.__add__` which doesn't know about `Value` objects — it would fail. By starting from `Value(0.0)`, the first addition is `Value.__radd__`, which works correctly.

The `nonlin` flag controls whether ReLU is applied. The last layer of a regression network typically omits the nonlinearity (raw linear output), while hidden layers use ReLU. This flag is passed down from `Layer` and ultimately from the `MLP` constructor.

micrograd/nn.py — Neuron (lines 13-28)

Layer and MLP: Composing Neurons into a Network

A `Layer` is simply a list of `Neuron`s that all receive the same input vector. Its `__call__` method runs each neuron's forward pass and returns either a list of outputs (for multi-neuron layers) or a single scalar `Value` (for single-neuron output layers). The scalar unwrapping — `return out[0] if len(out) == 1 else out` — is a quality-of-life feature: loss functions and training loops are simpler when the final prediction is a bare `Value` rather than a one-element list.

The `MLP` (Multi-Layer Perceptron) class composes multiple `Layer`s by interpreting a list of sizes `[n_in, h1, h2, ..., n_out]` as a sequence of layer dimensions. It zips adjacent pairs from this list to construct each layer. The `nonlin` flag is passed as `True` for all layers except the last, which is a common default for regression tasks. The overall `parameters()` method on `MLP` returns the flat concatenation of all layers' parameters, which in turn concatenates all neurons' parameters — demonstrating the recursive parameter collection pattern that mirrors PyTorch's `model.parameters()`.

micrograd/nn.py — Layer, MLP (lines 30-60)

← Backpropagation: Topological Sort and Gradient Flow Testing: Verifying Correctness Against PyTorch →