Activation functions are one of the most important design choices in neural networks. They determine what kinds of patterns a model can represent and how gradients flow during training. This chapter explains activation functions from the ground up, with interactive visualizations designed to build strong intuition.
In this guide, you will:
- See why activation functions are necessary for deep learning
- Explore the most widely used activation functions (Sigmoid, Tanh, ReLU, Leaky ReLU, ELU, Swish, GELU) together with their derivatives
- Understand saturation, vanishing gradients, and how modern activation functions help address these problems
1. Why Activation Functions?
Without activation functions, every layer in a neural network performs only a linear transformation: multiply by weights, then add a bias. The composition of linear functions is still linear:
\[f(\mathbf{x}) = W_2(W_1 \mathbf{x} + b_1) + b_2 = (W_2 W_1)\mathbf{x} + (W_2 b_1 + b_2) = W'\mathbf{x} + b'\]No matter how many layers you stack, the entire network collapses to a single linear transformation. Adding a nonlinear activation function between layers prevents this collapse and gives depth its expressive power. Each layer in a neural network computes a transformation of its input:
\[\mathbf{h} = W\mathbf{x} + \mathbf{b} = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}\]This is a linear transformation: a combination of rotation, scaling, shearing, and translation. Composing multiple such layers still produces another linear transformation, equivalent to multiplying the matrices together. But when we insert ReLU between layers, it makes the transformation nonlinear by clipping all negative values to zero.
\[\text{ReLU}(\mathbf{h}) = \begin{bmatrix} \max(0,\, h_1) \\ \max(0,\, h_2) \end{bmatrix}\]This clipping introduces nonlinearity. It folds parts of the space onto coordinate axes and creates bends that a single matrix transformation cannot reproduce. The visualization below shows this effect. A 2D grid of points is passed through several neural network layers. On the left, no activation function is applied, so each layer remains linear. On the right, ReLU is applied after every layer, progressively bending the grid and creating more complex structure.
2. Activation Function Explorer
The choice of activation function has a major impact on both training dynamics and final model performance. This explorer lets you visualize several of the most widely used activation functions together with their derivatives. The derivative is especially important because it determines how strongly weights are updated during backpropagation. If the derivative becomes very small (vanishing gradients) or exactly zero, learning can slow down or even stop.
\[\text{Sigmoid: } \sigma(x) = \frac{1}{1+e^{-x}} \qquad \sigma'(x) = \sigma(x)(1 - \sigma(x))\] \[\text{Tanh: } \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \qquad \tanh'(x) = 1 - \tanh^2(x)\] \[\text{ReLU: } f(x) = \max(0, x) \qquad f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}\] \[\text{Leaky ReLU: } f(x) = \max(\alpha x, x) \qquad f'(x) = \begin{cases} 1 & x > 0 \\ \alpha & x \leq 0 \end{cases}\] \[\text{ELU: } f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}\] \[\text{Swish: } f(x) = x \cdot \sigma(x)\] \[\text{GELU: } f(x) = x \cdot \Phi(x) \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right)\right)\]3. Sigmoid & Tanh Deep Dive
Sigmoid and tanh were widely used in early neural networks. They are smooth and differentiable everywhere, but they share an important limitation: saturation. When the input becomes very large or very small, the output flattens and the gradient approaches zero.2 This leads to the vanishing gradient problem. During backpropagation, gradients are multiplied across layers, so if each layer contributes a small factor, the overall gradient shrinks exponentially. As a result, early layers receive almost no learning signal and update very slowly.
4. ReLU Family
ReLU (Rectified Linear Unit) addresses the vanishing gradient problem with a simple rule: output zero for negative inputs and pass positive inputs through unchanged.4 This matters because during backpropagation gradients are multiplied at each layer by the local activation derivative. For sigmoid, if the input is large, the output is close to 1 and the derivative is close to 0; if the input is small, the output is close to 0 and the derivative is also close to 0. The maximum derivative of sigmoid is 0.25 at x=0, so stacking five sigmoid layers can shrink the gradient by a factor of at most \(0.25^5 \approx 0.001\), effectively stalling learning in early layers. 2
\[\frac{\partial L}{\partial w_1} = \underbrace{\sigma'(z_5) \cdot \sigma'(z_4) \cdot \sigma'(z_3) \cdot \sigma'(z_2) \cdot \sigma'(z_1)}_{\text{each} \leq 0.25 \implies \text{product} \leq 0.001} \cdot \ldots\]In contrast, the derivative of ReLU is either 0 or 1, so gradients along active paths can pass through many layers without shrinking. The bar chart below shows the average gradient magnitude per layer in a deep MLP. Increase the number of layers with sigmoid selected and the early-layer gradients quickly disappear; switch to ReLU and they remain much larger.
But ReLU has its own limitation: dead neurons. If a neuron’s input becomes consistently negative (for example due to initialization or a large gradient update), its output remains 0 and its gradient also remains 0, so the neuron stops learning and may never recover. Several variants in the ReLU family address this problem by allowing a small gradient even when the input is negative:
5. Modern Activations: Swish & GELU
Modern architectures such as EfficientNet, BERT, and GPT often use smoother activation functions that are not strictly monotonic and allow small negative values to pass through.
Swish \(f(x) = x \cdot \sigma(x)\).5 It is smooth, non-monotonic, and self-gated, meaning the input modulates its own output through the sigmoid term.
GELU \(f(x) = x \cdot \Phi(x)\) (Gaussian Error Linear Unit) uses the cumulative distribution function of the standard normal distribution.6 It is the default activation in Transformer models such as BERT and GPT.
Both functions behave similarly to ReLU for large positive inputs but transition smoothly near zero and allow small negative outputs. This helps optimization by preserving nonzero gradients for mildly negative inputs, unlike ReLU, which outputs exactly zero in that region.
6. Summary
Activation Function Cheat Sheet
| Function | Range | Pros | Cons | Use When |
|---|---|---|---|---|
| Sigmoid | (0, 1) | Smooth, probabilistic | Vanishing gradients, not zero-centered | Output layer for binary classification |
| Tanh | (-1, 1) | Zero-centered, stronger gradients than sigmoid | Still saturates | RNNs, hidden layers (legacy) |
| ReLU | [0, inf) | Fast, no saturation for positive | Dead neurons | Default for most hidden layers |
| Leaky ReLU | (-inf, inf) | No dead neurons | Extra hyperparameter | When dead neurons are a problem |
| ELU | (-alpha, inf) | Smooth, zero-centered outputs | Exp computation is slower | When mean activation near zero matters |
| Swish | ~(-0.28, inf) | Smooth, self-gated | Slightly more compute | EfficientNet, deep CNNs |
| GELU | ~(-0.17, inf) | Smooth, stochastic regularization effect | Slightly more compute | Transformers (BERT, GPT) |
Continue the ML Series
This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.
References
- Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS. http://proceedings.mlr.press/v9/glorot10a.html
- Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML. https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf
- Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941. https://arxiv.org/abs/1710.05941
- Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415. https://arxiv.org/abs/1606.08415