In our previous chapters on linear regression and logistic regression, we used gradient descent to find the optimal parameters for our model. We treated it as a black box: compute gradients, multiply by learning rate, update weights. But in practice, the choice of optimizer can make or break your model’s training, and a poorly tuned optimizer might never converge, oscillate wildly, or get stuck in saddle points. In this chapter, we will dive deep into gradient descent and explore the family of optimizers that power modern deep learning, from vanilla SGD all the way to Adam, with an interactive demo for every concept so you can build real intuition.
In this guide, you will:
- See how learning rate, momentum, and per-parameter scaling each shape the path an optimizer takes
- Compare SGD, Momentum, RMSProp, and Adam side by side on the same loss surface
- Build intuition for why Adam works so well in practice and where SGD with momentum still wins
1. The Core Idea: Follow the Slope Downhill
All gradient-based optimizers share the same fundamental principle: compute the gradient of the loss with respect to the parameters, then update the parameters in the direction that decreases the loss. Think of it as standing on a hilly landscape in dense fog: you cannot see the valley, but you can feel the slope under your feet, so you take a step in the steepest downhill direction, feel the slope again, and repeat. The question is: how big should each step be, and should we remember anything about previous steps?
The general update rule is:
\[\theta_{t+1} = \theta_t - \alpha \nabla_\theta J(\theta_t)\]where \(\alpha\) is the learning rate, \(\nabla_\theta J(\theta_t)\) is the gradient of the loss, and \(\theta\) represents our parameters. Before we jump into 2D contour plots, let’s build intuition with a simple 1D example. Consider the loss function \(J(\theta) = \theta^2\), whose gradient is just the slope \(\nabla J = 2\theta\), so each step moves \(\theta\) by \(\alpha \times \text{slope}\). Click on the curve to set a starting point, then step through gradient descent one update at a time and watch how the tangent line determines both the step direction and size.
2. Vanilla Gradient Descent (Batch GD)
The simplest optimizer computes the gradient over the entire dataset and takes one step:
\[\theta := \theta - \alpha \nabla_\theta J(\theta)\]This is called Batch Gradient Descent because it uses the full training dataset for every update. Since each step is based on the exact gradient over all training examples, the path is smooth and deterministic, and a fixed dataset, model, starting point, and learning rate will trace the same path each time. The downside is speed: large datasets force you to compute the gradient over every example before taking even a single step. Click anywhere on the contour plot below to set a starting point, then watch batch gradient descent move toward the minimum, and adjust the learning rate to see how it affects convergence.
3. The Learning Rate Playground
The learning rate \(\alpha\) is arguably the single most important hyperparameter. Here is why:
- Too small (\(\alpha = 0.0001\)): steps are tiny, training takes forever, and you might run out of patience or compute budget before reaching the minimum.
- Just right (\(\alpha = 0.003\)): smooth, steady convergence to the minimum in a reasonable number of steps.
- Too large (\(\alpha = 0.02\)): steps overshoot the minimum, the optimizer bounces back and forth, and may even diverge, moving farther and farther from the solution.
The canvases below show the same surface and the same starting point at three different learning rates, so you can see how dramatically the behavior changes from one setting to the next. Use the scale slider to adjust all three learning rates simultaneously and find the sweet spot for this problem. Note that the optimal learning rate can vary widely across different problems, so experimentation is key!
4. Stochastic Gradient Descent (SGD)
Batch GD computes the gradient over the entire dataset before making a single step. When datasets are large (millions of samples), this is extremely slow. Stochastic Gradient Descent fixes this by using a single random sample per update:
\[\theta := \theta - \alpha \nabla_\theta J(\theta;\; x^{(i)}, y^{(i)})\]The gradient from a single sample is a noisy estimate of the true gradient, which makes the path zigzag, but the noise has a surprising benefit: it can help the optimizer escape shallow local minima and explore more of the loss surface. The demo below trains a real linear model \(\hat{y} = wx + b\) on a small noisy dataset and visualizes the path of both optimizers on the actual MSE loss surface in \((w, b)\) parameter space. Each tick is one epoch of compute: Batch GD performs one update using the gradient averaged over all \(N\) samples, while SGD performs \(N\) updates each based on a single random sample, so both consume the same total gradient computations per tick. SGD’s path zigzags because each step is based on a single example, but it covers far more ground per unit compute, which is why SGD often wins in wall-clock time on large datasets.
Building Block: Exponential Moving Averages
Before diving into Momentum, RMSProp, and Adam, let’s understand the mathematical primitive they all share: the exponential moving average (EMA). Given a noisy sequence of values \(g_1, g_2, \ldots\) (think: gradients at each training step), the EMA produces a smoothed version:
\[\bar{g}_t = \beta \, \bar{g}_{t-1} + (1 - \beta) \, g_t\]The hyperparameter \(\beta\) controls how much history to retain. A higher \(\beta\) means heavier smoothing (the average “remembers” roughly the last \(\frac{1}{1 - \beta}\) values). This is exactly the operation inside Momentum (smoothing gradients), RMSProp (smoothing squared gradients), and Adam (both). Drag the \(\beta\) slider below to see how EMA transforms a noisy gradient signal into a smooth trend. This is the core idea behind all the optimizers we’ll cover next. By adjusting \(\beta\), you can see how the smoothed signal becomes more or less responsive to recent changes in the raw gradient.
5. Momentum
Vanilla GD can oscillate when the loss surface is shaped like a narrow valley, steep in one direction and shallow in another, so instead of moving directly toward the minimum it keeps bouncing back and forth across the steep sides while making only slow progress along the valley floor. Momentum helps fix this by maintaining a velocity that accumulates past gradients, like a ball rolling downhill that builds up speed in directions that stay consistent and reduces oscillations in directions that keep changing, which lets the optimizer move more smoothly and usually faster toward the minimum.
\[v_t = \beta \, v_{t-1} + \alpha \, \nabla_\theta J(\theta)\] \[\theta := \theta - v_t\]The hyperparameter \(\beta\) (typically 0.9) controls how much of the previous velocity is retained, so a higher \(\beta\) means more momentum and the optimizer remembers more of its earlier direction. Both panels in the demo share the same elongated bowl, start point, and learning rate. Setting \(\beta = 0\) recovers vanilla GD; increasing it produces a smoother, faster path along the valley.
Notice that the velocity update has the same shape as the EMA recurrence from the previous section, just with \(\alpha\) folded into the new term in place of \((1 - \beta)\), so \(v_t\) is a running average of recent gradients with effective window \(\frac{1}{1 - \beta}\). Consistent gradients along the valley accumulate inside this average and the velocity grows; oscillating gradients across the walls cancel out and the net step shrinks. At \(\beta = 0\) the average has no memory and the update reduces to vanilla GD.
6. RMSProp
Momentum helps gradient descent move faster, but it still uses the same learning rate for every parameter. This can be a problem when different parameters behave very differently. Some directions may have large gradients and need smaller steps, while other directions may have small gradients and need larger steps. RMSProp solves this by adapting the learning rate separately for each parameter. It keeps a running average of the squared gradients:
\[s_t = \beta s_{t-1} + (1 - \beta)(\nabla_\theta J)^2\]Then it uses this value to scale the update:
\[\theta := \theta - \frac{\alpha}{\sqrt{s_t + \epsilon}} \nabla_\theta J\]The key idea is simple: read the update one parameter at a time. The value $s_t$ is an exponential moving average, like in momentum, but it tracks squared gradients instead of gradients. So $s_t$ acts like a per-parameter “gradient strength meter”. If a parameter has had large gradients recently, its $s_t$ becomes large. RMSProp then divides the update by $\sqrt{s_t + \epsilon}$, which makes the effective step size smaller for that parameter. This means that parameters with consistently large gradients get smaller effective learning rates, while parameters with smaller gradients are not reduced as much. On an elongated loss surface, this is exactly what we want. The steep direction often causes large gradients and zigzagging, so RMSProp damps that direction. The shallow direction has smaller gradients, so it can keep moving forward. As a result, the optimization path becomes smoother and moves more directly toward the minimum.
Below, the left panel shows Momentum and the right shows RMSProp on the same elongated bowl. Momentum accelerates along the valley but applies the same \(\alpha\) to both dimensions, so it can still overshoot in y before settling. RMSProp has no velocity, but it immediately shrinks y’s effective \(\alpha\) because y’s gradients are large, while x’s effective \(\alpha\) stays near the base value.
7. Adam: The Best of Both Worlds
Adam (Adaptive Moment Estimation) combines the ideas of Momentum and RMSProp. It maintains both a first moment (mean of gradients, like Momentum) and a second moment (mean of squared gradients, like RMSProp), plus bias correction to account for the fact that both estimates start at zero.
First moment (momentum):
\[m_t = \beta_1 \, m_{t-1} + (1 - \beta_1) \nabla_\theta J\]Second moment (adaptive learning rate):
\[v_t = \beta_2 \, v_{t-1} + (1 - \beta_2) (\nabla_\theta J)^2\]Bias correction:
\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]Update rule:
\[\theta := \theta - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]The default hyperparameters (\(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\)) work well across a wide range of problems, which is why Adam is the most popular optimizer in deep learning.
The bias correction is important because both \(m_t\) and \(v_t\) are initialized at zero, which makes them biased toward zero during the early training steps. The correction factors \((1 - \beta_1^t)\) and \((1 - \beta_2^t)\) compensate for this initialization bias and allow the estimates to better reflect the true first and second moments of the gradients from the beginning of training. Without this correction, Adam would take very small steps at first and only gradually increase their size over time, slowing early progress.
8. Escaping Saddle Points
In high-dimensional optimization problems such as deep learning, saddle points are much more common than local minima. A saddle point is a location where the gradient is zero, but the surface curves upward in one direction and downward in another, like the middle of a horse saddle. Basic gradient descent with a small learning rate can slow down or stall near saddle points because the gradient magnitude becomes very small. Optimizers with momentum keep moving using information from previous gradients, which helps them pass through these flat regions, and Adam goes further by combining momentum with per-parameter adaptive step sizes. On a surface like \(f(x, y) = x^2 - y^2\) with a saddle at the origin, vanilla GD crawls because the gradient near the saddle is tiny, while Adam’s second-moment estimate \(\hat{v}_t\) also stays small along the flat direction, so the ratio \(\hat{m}_t / \sqrt{\hat{v}_t}\) keeps the effective step a useful size and pushes the optimizer through.
9. Mini-Batch Size: The Noise Knob
In practice, we almost never use pure SGD (batch size = 1) or full Batch GD. Instead, we use mini-batch gradient descent, where each update averages the gradient over a small batch of \(B\) samples:
\[\theta := \theta - \frac{\alpha}{B} \sum_{i=1}^{B} \nabla_\theta J(\theta;\; x^{(i)}, y^{(i)})\]The batch size acts as a noise knob:
| Batch Size | Gradient Quality | Update Frequency | GPU Utilization |
|---|---|---|---|
| 1 (pure SGD) | Very noisy | Very fast | Low |
| 32 (typical) | Moderate noise | Fast | Good |
| 256 | Low noise | Moderate | High |
| Full dataset | No noise | Slow | Varies |
Larger batches give smoother gradients but fewer updates per epoch, while smaller batches add noise that can help generalization at the cost of noisier convergence. Each step on a mini-batch costs proportional to \(B\) in compute, but the gradient noise scales like \(1 / \sqrt{B}\), so doubling the batch size only halves the noise at quadruple the cost. That diminishing return is why 32 to 256 is the sweet spot in practice: small enough that updates are frequent and gradient noise gives a regularizing effect, large enough that each gradient is a reasonable estimate and the GPU stays busy.
10. Summary and Comparison
Here is a reference table of all the optimizers we covered:
| Optimizer | Update Rule | Key Idea | Hyperparameters |
|---|---|---|---|
| Batch GD | $$\theta - \alpha \nabla J$$ | Full-batch, deterministic | $$\alpha$$ |
| SGD | $$\theta - \alpha \nabla J_i$$ | Single-sample noisy gradient | $$\alpha$$ |
| Momentum | $$\theta - v_t$$ where $$v_t = \beta v_{t-1} + \alpha \nabla J$$ | Accumulate past gradients | $$\alpha, \beta$$ |
| RMSProp | $$\theta - \frac{\alpha}{\sqrt{s_t + \epsilon}} \nabla J$$ | Per-parameter adaptive rates | $$\alpha, \beta$$ |
| Adam | $$\theta - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$ | Momentum + adaptive rates + bias correction | $$\alpha, \beta_1, \beta_2$$ |
When to use what?
- Adam is the default choice for most deep learning tasks. Its defaults (\(\alpha = 0.001\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\)) work well out of the box.
- SGD + Momentum often generalizes better than Adam on well-tuned models (especially in computer vision), but requires more careful learning rate tuning and scheduling.
- RMSProp is popular for recurrent neural networks and reinforcement learning.
- Batch GD is mainly used for small datasets or convex problems where you want deterministic convergence.
The optimizer is not just a knob to turn; it fundamentally shapes how your model navigates the loss landscape, and understanding the tradeoffs between speed, stability, and generalization is what makes the difference between a model that trains and one that does not. When in doubt, start with Adam and tune from there.
Continue the ML Series
This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series. Next up is Perceptron & MLP, where we put these optimizers to work training a multi-layer perceptron from scratch.