Perceptrons and Neural Networks
A perceptron is one of the simplest models for binary classification: it computes a weighted sum of inputs, adds a bias, and applies an activation function. If you want background on the closely related linear classifier, see Logistic Regression. This chapter is fully self-contained, so you can continue directly from here.
The perceptron at its core is the same computation as logistic regression: a weighted sum of inputs passed through an activation function. The key difference is how this output is interpreted and how multiple neurons can be combined. When we stack neurons into layers, this simple computation becomes the foundation of neural networks.
1. The Single Neuron
A single neuron takes inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function:
\[z = w_1 x_1 + w_2 x_2 + b = \mathbf{w} \cdot \mathbf{x} + b\] \[a = \sigma(z)\]where \(\sigma\) is an activation function (we will use the sigmoid \(\sigma(z) = \frac{1}{1+e^{-z}}\) for now). If \(a \geq 0.5\) we predict class 1, otherwise class 0.
With this choice of activation function, a single neuron is mathematically identical to logistic regression. The neural network perspective begins when we stack many such neurons into layers and allowing the model to learn more complex functions.
2. Learning Logic Gates
Boolean logic gates are among the simplest classification problems. Each gate defines a dataset with four points: two binary inputs and one binary output. For the AND gate, the output is 1 only when both inputs are 1. For the OR gate, the output is 1 when at least one input is 1. Both problems are linearly separable, meaning a single straight line can separate the class-1 points from the class-0 points. Because of this, a single perceptron can learn both gates easily.
| x₁ | x₂ | Target | Prediction |
|---|---|---|---|
| 0 | 0 | - | - |
| 0 | 1 | - | - |
| 1 | 0 | - | - |
| 1 | 1 | - | - |
3. The XOR Problem, Where Single Neurons Fail
Now consider the XOR gate: the output is 1 when the two inputs are different.
| x₁ | x₂ | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
If we plot these four points on a 2D plane, the class-1 points lie at (0,1) and (1,0), which are diagonally opposite corners. The class-0 points lie at (0,0) and (1,1). Unlike the AND and OR gates, no single straight line can separate these two classes. This is not a matter of choosing better weights. It is mathematically impossible. A single neuron can only produce a linear decision boundary, but the XOR problem requires a nonlinear one. This limitation motivates the need for multiple neurons arranged in layers.
| x₁ | x₂ | Target | Prediction |
|---|---|---|---|
| 0 | 0 | 0 | - |
| 0 | 1 | 1 | - |
| 1 | 0 | 1 | - |
| 1 | 1 | 0 | - |
4. The Solution: Hidden Layers
The key insight is simple: if one neuron can draw one line, two neurons can draw two lines, and another neuron can combine them. By stacking neurons into layers, we can adjust the input space into increasingly complex regions.
A Multi-Layer Perceptron (MLP) adds one or more hidden layers between the input and the output:
Layer 1 (hidden):
\[\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)\]Layer 2 (output):
\[\hat{y} = \sigma(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2)\]Each hidden neuron creates its own linear boundary in the input space. The output neuron then combines these intermediate features into a nonlinear decision rule.
For XOR, we need only two hidden neurons. One hidden neuron can learn one diagonal separation, the second learns the other, and the output neuron combines them to correctly classify all four points. We will refer to this architecture as 2-2-1. The three numbers are the size of each layer in order, so two inputs, two hidden units, one output.
| x₁ | x₂ | Target | Prediction |
|---|---|---|---|
| 0 | 0 | 0 | - |
| 0 | 1 | 1 | - |
| 1 | 0 | 1 | - |
| 1 | 1 | 0 | - |
5. Network Architecture Playground
Now let us explore how network architecture affects what a model can learn. Choose a dataset, adjust the number of hidden layers and neurons and watch how the decision boundary changes as the network trains. This makes it easier to see how deeper or wider networks can represent more complex patterns.
6. Activation Functions Compared
The activation function determines the type of nonlinearity each neuron introduces. Without nonlinear activation functions, stacking layers would still produce only a linear model. The three most common choices are sigmoid, tanh, and ReLU.
Sigmoid
\[\sigma(z) = \frac{1}{1+e^{-z}}\]Maps outputs to the range (0, 1). It is smooth and interpretable as a probability, but its gradients become very small for large positive or negative inputs, which can slow learning.
Tanh
\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]Maps outputs to the range (-1, 1). Because it is zero-centered, optimization is often more stable than with sigmoid.
ReLU
\[f(z) = \max(0, z)\]Simple and computationally efficient. It avoids saturation for positive inputs and is the standard choice in modern deep learning. However, neurons can become inactive if they receive only negative inputs during training.
7. How Backpropagation Works (Intuition)
Training a neural network means finding weights that minimize the loss. We use gradient descent for this, but the challenge is how do we compute the gradient of the loss with respect to a weight that is deep inside the network? The key idea is the chain rule. The loss depends on the output, the output depends on intermediate activations, and those activations depend on earlier weights. By applying the chain rule, we can trace how a small change in a weight affects the final loss:
\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}\]Backpropagation computes these gradients layer by layer, starting from the output and moving backward through the network. Each layer receives a gradient from the next layer and passes it backward after scaling it by its local derivative. In this way, every weight in the network learns how it contributed to the final error and how it should change to reduce it.
z is the pre-activation (weighted sum + bias), a is the activation (sigmoid of z), ŷ is the network's prediction, L is the loss, δ is the gradient of the loss with respect to z at that neuron, and η is the learning rate. We derive backprop fully in the Backpropagation Visualized guide.8. Universal Approximation
One of the most important theoretical results in neural network research is the Universal Approximation Theorem: a neural network with a single hidden layer containing enough neurons can approximate any continuous function on a bounded domain to arbitrary accuracy. In other words, even a shallow network is expressive enough to represent very complex functions. The practical question, however, is not whether a network can represent a function, but how many neurons are needed and whether such a network can be trained efficiently in practice.
9. Summary
| Concept | Key Idea |
|---|---|
| Perceptron | A single neuron: weighted sum + activation. With sigmoid activation, it is equivalent to logistic regression. |
| Linear separability | A single perceptron can learn linearly separable patterns (AND, OR) but not XOR. |
| Multi-Layer Perceptron | Adding hidden layers enables nonlinear decision boundaries. |
| Backpropagation | Applies the chain rule layer by layer to compute gradients for all weights. |
| Activation functions | Sigmoid, tanh, and ReLU introduce nonlinearity with different training behavior. |
What’s next: In Backpropagation Visualized, we will explore how backpropagation works step by step and why deep networks can be difficult to train.
Continue the ML Series
This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.