In the Perceptron & MLP guide, we built multi-layer perceptrons and we saw how they learn. But we treated the weight update as a black box: how does the network know which weight to adjust and by how much? This chapter answers that question and is fully self-contained, so you can continue directly from here. The answer is backpropagation: an algorithm that computes the gradient of the loss with respect to every weight in the network using the chain rule from calculus.

In this guide, you will:

  • Visualize the chain rule on a computational graph and trace gradients through each node
  • See how weights evolve as a network trains, and how the loss function shapes that evolution
  • Train a real network on 2D classification tasks in a configurable playground

1. The Chain Rule

Every neural network computation can be written as a computational graph: a directed graph where each node performs a simple operation, such as addition, multiplication, or an activation function. Backpropagation is just the chain rule applied backward through this graph.

Consider a simple expression:

\[f(x, y, z) = (x + y) \cdot z\]

Let

\[q = x + y\]

so that

\[f = q \cdot z\]

Now we can compute the gradients step by step:

\[\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x} = z \cdot 1 = z\] \[\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial y} = z \cdot 1 = z\] \[\frac{\partial f}{\partial z} = q = x + y\]

The important idea is that each node only needs to know its local derivative. By multiplying these local derivatives together, we get the gradient of the final output with respect to every input. The demo below shows this visually. Click any node to see its local derivative and how it contributes to the final gradient.

Click a node to see the chain rule derivation at that point.
Setup: f(x,y,z) = (x+y)·z. Drag the sliders, click any node to see its local gradient.

2. Backpropagation

One full training step has four phases: a forward pass that produces a prediction, a loss computation that compares the prediction to the target, a backward pass that propagates gradients to every weight using the chain rule, and a weight update that adjusts each weight according to \(w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}\).

In practice, this cycle is repeated many times until the weights converge to values that solve the task. The demo below trains a 2-4-1 network on the XOR problem with inputs [0,0], [0,1], [1,0], [1,1] and targets 0, 1, 1, 0, using binary cross-entropy loss \(L = -\bigl[y \log \hat{y} + (1-y)\log(1-\hat{y})\bigr]\) and He initialization, where each weight is drawn from a distribution scaled by \(\sqrt{2/n_{\text{in}}}\) to keep signal variance stable across layers.

Step: 0
Click Train to step, Continuous to run.
Settings: 2-4-1 network on XOR, sigmoid, BCE loss, full-batch SGD.

The choice of activation matters here too. See the Activation Functions guide for an in-depth look at how different activations affect gradient flow and learning dynamics.


3. Playground

Time to see the whole training loop in action. Choose a dataset, select an activation function, configure the network architecture, and press Train. The decision boundary on the left updates as the network learns, while the loss curve on the right shows how the loss changes over epochs. Try more difficult datasets such as Spiral with deeper or wider networks, or add noise to test how robust your model is.

Epoch: 0 Loss: --
Left: decision boundary. Right: training loss curve. Updates every 5 epochs.
Settings: configurable MLP, He init, BCE loss, full-batch SGD. Try Circle at LR 0.5 to 1.5, XOR at 1.0 to 2.0, Spiral at 1.5 to 2.5 with ReLU and 3 hidden layers.

4. Summary

Concept Key Idea
Chain Rule Gradients propagate backward by multiplying local derivatives at each node.
Forward Pass Data flows input to output, computing weighted sums and activations.
Backward Pass Gradients flow output to input, applying the chain rule at every connection.
Weight Update Each weight is nudged opposite to its gradient: \(w \leftarrow w - \eta \nabla_w L\).
Computational Graph Any expression can be decomposed into a graph for automatic differentiation.

Backpropagation is not just an algorithm, it is a way of thinking about computation. Every modern deep learning framework (PyTorch, TensorFlow, JAX) is built around the idea of recording a computational graph during the forward pass and then traversing it backward to compute gradients automatically. This process is called automatic differentiation, and backpropagation is its most important application in neural networks. The time complexity of backpropagation is linear in the number of operations in the forward pass, O(n), since each operation is visited once during the forward pass and once during the backward pass. This efficiency is what makes training networks with millions of parameters practical.

Continue the ML Series

This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series. Next up is Activation Functions, where we will explore different activation functions and understand how they affect gradient flow and network expressivity.