Activation Functions

Activation functions are one of the most important design choices in neural networks. They determine what kinds of patterns a model can represent and how gradients flow during training. This chapter explains activation functions from the ground up, with interactive visualizations designed to build strong intuition.

In this guide, you will:

See why activation functions are necessary for deep learning
Explore the most widely used activation functions (Sigmoid, Tanh, ReLU, Leaky ReLU, ELU, Swish, GELU) together with their derivatives
Understand saturation, vanishing gradients, and how modern activation functions help address these problems

1. Why Activation Functions?

Without activation functions, every layer in a neural network performs only a linear transformation: multiply by weights, then add a bias. The composition of linear functions is still linear:

\[f(\mathbf{x}) = W_2(W_1 \mathbf{x} + b_1) + b_2 = (W_2 W_1)\mathbf{x} + (W_2 b_1 + b_2) = W'\mathbf{x} + b'\]

No matter how many layers you stack, the entire network collapses to a single linear transformation. Adding a nonlinear activation function between layers prevents this collapse and gives depth its expressive power. Each layer in a neural network computes a transformation of its input:

\[\mathbf{h} = W\mathbf{x} + \mathbf{b} = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}\]

This is a linear transformation: a combination of rotation, scaling, shearing, and translation. Composing multiple such layers still produces another linear transformation, equivalent to multiplying the matrices together. But when we insert ReLU between layers, it makes the transformation nonlinear by clipping all negative values to zero.

\[\text{ReLU}(\mathbf{h}) = \begin{bmatrix} \max(0,\, h_1) \\ \max(0,\, h_2) \end{bmatrix}\]

This clipping introduces nonlinearity. It folds parts of the space onto coordinate axes and creates bends that a single matrix transformation cannot reproduce. The visualization below shows this effect. A 2D grid of points is passed through several neural network layers. On the left, no activation function is applied, so each layer remains linear. On the right, ReLU is applied after every layer, progressively bending the grid and creating more complex structure.

Without Activation (Linear Only)

With ReLU Activation

Layers 1

Click "Animate" to watch the transformation unfold layer by layer. Both sides use the same weight matrices.

2. Activation Function Explorer

The choice of activation function has a major impact on both training dynamics and final model performance. This explorer lets you visualize several of the most widely used activation functions together with their derivatives. The derivative is especially important because it determines how strongly weights are updated during backpropagation. If the derivative becomes very small (vanishing gradients) or exactly zero, learning can slow down or even stop.

\[\text{Sigmoid: } \sigma(x) = \frac{1}{1+e^{-x}} \qquad \sigma'(x) = \sigma(x)(1 - \sigma(x))\] \[\text{Tanh: } \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \qquad \tanh'(x) = 1 - \tanh^2(x)\] \[\text{ReLU: } f(x) = \max(0, x) \qquad f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}\] \[\text{Leaky ReLU: } f(x) = \max(\alpha x, x) \qquad f'(x) = \begin{cases} 1 & x > 0 \\ \alpha & x \leq 0 \end{cases}\] \[\text{ELU: } f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}\] \[\text{Swish: } f(x) = x \cdot \sigma(x)\] \[\text{GELU: } f(x) = x \cdot \Phi(x) \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right)\right)\]

Activation Function

Derivative

Sigmoid: output in (0,1), max derivative = 0.25 at x=0.

3. Sigmoid & Tanh Deep Dive

Sigmoid and tanh were widely used in early neural networks. They are smooth and differentiable everywhere, but they share an important limitation: saturation. When the input becomes very large or very small, the output flattens and the gradient approaches zero.² This leads to the vanishing gradient problem. During backpropagation, gradients are multiplied across layers, so if each layer contributes a small factor, the overall gradient shrinks exponentially. As a result, early layers receive almost no learning signal and update very slowly.

Input x 0.0

x = 0.0 | output = 0.500 | gradient = 0.250 | Gradient is HEALTHY

4. ReLU Family

ReLU (Rectified Linear Unit) addresses the vanishing gradient problem with a simple rule: output zero for negative inputs and pass positive inputs through unchanged.⁴ This matters because during backpropagation gradients are multiplied at each layer by the local activation derivative. For sigmoid, if the input is large, the output is close to 1 and the derivative is close to 0; if the input is small, the output is close to 0 and the derivative is also close to 0. The maximum derivative of sigmoid is 0.25 at x=0, so stacking five sigmoid layers can shrink the gradient by a factor of at most \(0.25^5 \approx 0.001\), effectively stalling learning in early layers. ²

\[\frac{\partial L}{\partial w_1} = \underbrace{\sigma'(z_5) \cdot \sigma'(z_4) \cdot \sigma'(z_3) \cdot \sigma'(z_2) \cdot \sigma'(z_1)}_{\text{each} \leq 0.25 \implies \text{product} \leq 0.001} \cdot \ldots\]

In contrast, the derivative of ReLU is either 0 or 1, so gradients along active paths can pass through many layers without shrinking. The bar chart below shows the average gradient magnitude per layer in a deep MLP. Increase the number of layers with sigmoid selected and the early-layer gradients quickly disappear; switch to ReLU and they remain much larger.

Activation: Layers 6

Bars show average |gradient| per layer (log scale). Layer 1 = first hidden layer (closest to input).

Settings: deep MLP, He initialization, single forward+backward pass on random input. Early-layer bars shrink with sigmoid as depth grows; ReLU keeps them comparable.

But ReLU has its own limitation: dead neurons. If a neuron’s input becomes consistently negative (for example due to initialization or a large gradient update), its output remains 0 and its gradient also remains 0, so the neuron stops learning and may never recover. Several variants in the ReLU family address this problem by allowing a small gradient even when the input is negative:

Leak/Alpha 0.01

ReLU: zero for x < 0. Leaky ReLU allows a small gradient. ELU smoothly approaches -alpha.

5. Modern Activations: Swish & GELU

Modern architectures such as EfficientNet, BERT, and GPT often use smoother activation functions that are not strictly monotonic and allow small negative values to pass through.

Swish \(f(x) = x \cdot \sigma(x)\).⁵ It is smooth, non-monotonic, and self-gated, meaning the input modulates its own output through the sigmoid term.

GELU \(f(x) = x \cdot \Phi(x)\) (Gaussian Error Linear Unit) uses the cumulative distribution function of the standard normal distribution.⁶ It is the default activation in Transformer models such as BERT and GPT.

Both functions behave similarly to ReLU for large positive inputs but transition smoothly near zero and allow small negative outputs. This helps optimization by preserving nonzero gradients for mildly negative inputs, unlike ReLU, which outputs exactly zero in that region.

Swish and GELU are smooth and non-monotonic near zero, unlike ReLU's hard corner.

6. Summary

Activation Function Cheat Sheet

Function	Range	Pros	Cons	Use When
Sigmoid	(0, 1)	Smooth, probabilistic	Vanishing gradients, not zero-centered	Output layer for binary classification
Tanh	(-1, 1)	Zero-centered, stronger gradients than sigmoid	Still saturates	RNNs, hidden layers (legacy)
ReLU	[0, inf)	Fast, no saturation for positive	Dead neurons	Default for most hidden layers
Leaky ReLU	(-inf, inf)	No dead neurons	Extra hyperparameter	When dead neurons are a problem
ELU	(-alpha, inf)	Smooth, zero-centered outputs	Exp computation is slower	When mean activation near zero matters
Swish	~(-0.28, inf)	Smooth, self-gated	Slightly more compute	EfficientNet, deep CNNs
GELU	~(-0.17, inf)	Smooth, stochastic regularization effect	Slightly more compute	Transformers (BERT, GPT)

Continue the ML Series

This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.

References

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS. http://proceedings.mlr.press/v9/glorot10a.html
Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML. https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941. https://arxiv.org/abs/1710.05941
Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415. https://arxiv.org/abs/1606.08415