Any model that learns weights from data can overfit by assigning extreme weight values to chase noise in the training set. This applies to linear regression, polynomial regression, logistic regression, and neural networks alike. Regularization is a general technique that prevents this by adding a penalty to the loss function that discourages large weights, forcing the model to find simpler solutions that generalise better. Regularization is not tied to any particular model, and in this chapter we use polynomial regression as a visual playground because it makes overfitting easy to inspect, but every formula and insight here applies to any model that minimises a weighted sum of features.
In this guide, you will:
- See why overfitting happens at the coefficient level through large weights
- Watch Ridge (L2) smoothly shrink coefficients toward zero as the penalty grows
- Understand the geometry that explains why Lasso (L1) produces exact zeros and performs feature selection
- Explore Elastic Net, which combines both L1 and L2 penalties, and see how it can balance coefficient shrinkage with sparsity
1. The Overfitting Problem
When a model has more capacity than the data justifies, it compensates by assigning large coefficient values to fit noise. Below, 20 noisy points are generated from a smooth true function (\(y = \sin(1.2x) + 0.4x - 1\) plus Gaussian noise) and we fit a degree-10 polynomial using the closed-form equation \(\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) with no regularization. The fit passes near every point but oscillates wildly, and the coefficient magnitudes on the right show that several weights have grown into the hundreds. This is a demo using closed form solution, but the same thing happens with gradient descent or any other optimization method. The model is trying to reduce training error as much as possible, and without any penalty on weight size, the cheapest way to do that is to assign huge weights that chase noise. This leads to a very wiggly curve that will perform poorly on new data.
Each time you click New Data the coefficients change dramatically, which is a sign of high variance. The regularization approach prevents this by adding a penalty to the loss function that discourages large weights:
\[J_{\text{regularized}}(\mathbf{w}) = \underbrace{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}_{\text{data fit (MSE)}} + \underbrace{\lambda \cdot R(\mathbf{w})}_{\text{penalty}}\]Here \(\lambda > 0\) controls the penalty strength and \(R(\mathbf{w})\) is the regularization term, and the choice of \(R\) gives us different regularizers. The next sections work through the three most common ones: Ridge (L2), Lasso (L1), and Elastic Net.
2. Ridge Regression (L2 Regularization)
Ridge regression adds the sum of squared weights as the penalty:
\[J_{\text{Ridge}}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \sum_{j=1}^{d} w_j^2\]We typically do not penalise the bias term \(w_0\), since the bias just shifts the entire function up or down and does not contribute to overfitting. The penalty discourages any single weight from becoming too large, and a larger \(\lambda\) means a stronger penalty and smaller weights. One useful property of Ridge is that it has a closed-form solution. Starting from the normal equation and adding the penalty gives:
\[\mathbf{w}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T\mathbf{y}\]Compare this to ordinary least squares: \(\mathbf{w}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}\). The only difference is the \(\lambda \mathbf{I}\) term added to \(\mathbf{X}^T\mathbf{X}\), which has two effects: it shrinks all coefficients toward zero (with more shrinkage for larger \(\lambda\)), and it guarantees invertibility because even if \(\mathbf{X}^T\mathbf{X}\) is singular, adding \(\lambda \mathbf{I}\) makes it positive definite.
Production machine learning typically uses gradient-descent-based optimizers, but in this chapter we use the closed-form Ridge solution because it lets us directly compute the optimal weights for any \(\lambda\) without iterative optimization. The demo below reuses the same data points from Section 1, with the slider starting at \(\lambda \approx 0\) so the fit matches the wild overfit curve you just saw. As you drag \(\lambda\) upward the curve smooths out toward the true function and the coefficient bars shrink, and once \(\lambda\) becomes very large the curve flattens to nearly a constant. The coefficients shrink but never reach exactly zero, which is the key limitation of Ridge: no matter how large \(\lambda\) gets, every feature is kept in the model, just with reduced influence.
3. Lasso Regression (L1 Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute values of the weights instead of the sum of squares:
\[J_{\text{Lasso}}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \sum_{j=1}^{d} \lvert w_j \rvert\]Unlike Ridge, Lasso has no closed-form solution because the absolute value \(\lvert w_j \rvert\) is not differentiable at zero, so we cannot simply take the derivative and set it to zero. Instead, Lasso is solved using coordinate descent: we update one weight at a time while keeping all others fixed. For each weight \(w_j\), coordinate descent first computes how much the data wants that weight to be, a quantity we call the signal \(\rho_j\) that measures the correlation between feature \(j\) and the residual error left over by the other features. It then applies a rule called soft thresholding: if the signal is strongly positive (above \(\lambda\)) the weight is set to a positive value shifted down by \(\lambda\), if it is strongly negative (below \(-\lambda\)) the weight is set to a negative value shifted up by \(\lambda\), and if the signal is weak (between \(-\lambda\) and \(\lambda\)) the weight is set to exactly zero. In math, this is:
\[w_j \leftarrow \begin{cases} (\rho_j - \lambda) / z_j & \text{if } \rho_j > \lambda \\ 0 & \text{if } \lvert \rho_j \rvert \leq \lambda \\ (\rho_j + \lambda) / z_j & \text{if } \rho_j < -\lambda \end{cases}\]where \(z_j\) is a normalisation factor (the sum of squared values of that feature column). Intuitively, for each feature the data sends a signal saying that the feature should have weight \(\rho_j\), and Ridge always listens but dampens that signal, whereas Lasso has a threshold and ignores any signal weaker than \(\lambda\) by setting the weight to zero. The larger \(\lambda\) is, the wider this dead zone becomes and the more features are eliminated from the model entirely.
4. Why Lasso Produces Zeros: L1 vs L2 Geometry
Now that you have seen both Ridge and Lasso in action, the next question is why Lasso drives coefficients to exactly zero while Ridge does not. Regularization can be viewed as a constrained optimisation problem: instead of minimising \(J(\mathbf{w}) + \lambda R(\mathbf{w})\), we can equivalently minimise \(J(\mathbf{w})\) subject to \(R(\mathbf{w}) \leq t\) for some budget \(t\). For L2 (Ridge) this constraint is \(\sum w_j^2 \leq t\) and the constraint region is a circle (a sphere in higher dimensions), while for L1 (Lasso) the constraint is \(\sum \lvert w_j \rvert \leq t\) and the region is a diamond (a cross-polytope).
The optimal solution sits where the elliptical contours of the loss function first touch the constraint region. Because the diamond has corners that lie exactly on the coordinate axes, the contours are much more likely to touch at a corner, which means one or more weights are exactly zero. The circle has no corners, so the touching point is almost never on an axis. The demo below shows this in 2D with two weights (\(w_1, w_2\)), where the ellipses represent contours of the MSE loss and the shaded region is the constraint boundary. Drag the contour centre and the ellipse angle to see how the touch point behaves for different loss orientations.
5. Elastic Net: The Best of Both Worlds
Elastic Net combines L1 and L2 penalties using a mixing parameter \(\alpha \in [0, 1]\):
\[J_{\text{ElasticNet}}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \left[\alpha \sum_{j=1}^{d}|w_j| + (1-\alpha)\sum_{j=1}^{d}w_j^2\right]\]Setting \(\alpha = 1\) recovers pure Lasso, \(\alpha = 0\) recovers pure Ridge, and any value in between blends the two. The reason to combine them is that pure Lasso has a limitation when features are highly correlated, since it tends to pick one and ignore the rest, whereas the L2 component of Elastic Net encourages correlated features to share the weight more evenly while the L1 component still drives some weights to zero. The demo below shows this geometrically: as \(\alpha\) changes, the constraint region morphs from a circle (Ridge) to a diamond (Lasso), passing through rounded-diamond shapes at intermediate values that can still produce sparsity but are smoother than pure Lasso.
6. Summary
Regularization penalises complexity by adding a term to the loss function that discourages large weights, with a hyperparameter \(\lambda\) that controls the penalty strength. Ridge (L2) shrinks all coefficients smoothly toward zero but never eliminates any, which makes it a good default when every feature is likely to contribute and you just want to prevent overfitting. Lasso (L1) can drive coefficients to exactly zero and therefore performs automatic feature selection, which is explained geometrically by the diamond-shaped constraint region having corners on the coordinate axes. Elastic Net combines both penalties with a mixing parameter \(\alpha\), inheriting sparsity from L1 and the grouping effect from L2, which makes it a strong choice when features are correlated. Choosing \(\lambda\) in practice is typically done with cross-validation by sweeping a range of values on a log scale and picking the one with the lowest validation error.
| Property | Ridge (L2) | Lasso (L1) | Elastic Net |
|---|---|---|---|
| Penalty | $$\lambda \sum w_j^2$$ | $$\lambda \sum |w_j|$$ | $$\lambda[\alpha\sum|w_j| + (1-\alpha)\sum w_j^2]$$ |
| Constraint shape | Circle (sphere) | Diamond (cross-polytope) | Rounded diamond |
| Sparsity | No, coefficients shrink but never reach zero | Yes, drives coefficients to exactly zero | Yes, but less aggressively than Lasso |
| Feature selection | No | Yes, automatic | Yes |
| Correlated features | Shares weight among correlated features | Picks one, ignores the rest | Groups correlated features together |
| When to use | All features likely relevant; prevent overfitting | Many irrelevant features; want interpretability | Correlated features; want sparsity + stability |
Continue the ML Series
This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.