Polynomial Regression & the Bias-Variance Tradeoff

In the Linear Regression guide, we built linear regression from scratch and saw how a straight line can capture the trend in data. But the real world is rarely so simple. When the underlying relationship between input and output is nonlinear, forcing a straight line through the data leaves systematic patterns in the residuals and the model is too simple for the data, which is what we call underfitting. The natural next step is to let our model learn curves and that is exactly what polynomial regression does. But with greater flexibility the model can bend so aggressively that it memorises noise rather than capturing the true pattern, which is overfitting. The tension between these two extremes is the bias-variance tradeoff that sits at the heart of machine learning.

In this guide, you will:

Extend linear regression to polynomial features and fit smooth curves through noisy data
Visualize the bias-variance tradeoff by training the same model on many random samples
Build intuition for how model complexity impacts bias and variance, and how regularization can help

1. Polynomial Regression

Linear regression fits a straight line \(h(x) = w_0 + w_1 x\). Polynomial regression simply adds powers of \(x\) as extra features so that \(h(x) = w_0 + w_1 x + w_2 x^2 + \ldots + w_d x^d\), where \(d\) is the degree of the polynomial. A degree-1 polynomial is a line, degree 2 is a parabola, degree 3 can have one inflection point, and so on. Despite the nonlinear features, this is still a linear model in the parameters \(w_0, w_1, \ldots, w_d\), and we simply construct a new feature matrix:

\[\mathbf{X} = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}\]

The demo below fits three polynomials at degrees 1, 3, and 5 to a small noisy dataset. Click on the canvas to add your own points or use the sample data button to draw a fresh batch and watch how each curve responds. The polynomial coefficients are found via the closed-form normal equation with a small ridge term for stability. This is used purely for demonstration purposes and in practice, you would typically use an iterative optimizer like gradient descent or a library function that is numerically stable and efficient for higher degrees.

Click to add points. Minimum 2 required.

Settings: 15 sampled points from a noisy sinusoid, polynomial fit via the closed-form solution with a small ridge term for stability.

2. Underfitting, Overfitting, and the Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept that describes the tension between underfitting and overfitting. A model with low complexity (like a degree-1 polynomial) has high bias because it cannot capture the true pattern in the data, leading to systematic errors. A model with high complexity (like a degree-10 polynomial) has high variance because it can fit the training data perfectly, including the noise, but will perform poorly on new data. The sweet spot is somewhere in between, where the model is flexible enough to capture the underlying pattern but not so flexible that it chases noise. The demo below lets you explore this tradeoff by fitting polynomials of varying degrees to the same underlying function with different random samples.

Twenty-five noisy points are sampled from a hidden true function shown as a dashed purple line, and the slider controls the polynomial degree from 1 to 15. At degree 1 to 2 the curve is too rigid to capture the true shape and sits systematically off the data, which is the underfitting regime. Around degree 3 to 5 the curve tracks the true function closely, the sweet spot. By degree 10 and above the curve oscillates wildly between points and chases the noise rather than the signal, which is the overfitting regime.

Degree: 3

Settings: 25 points sampled from y = sin(1.5x) + 0.5x + noise (sigma 0.5), polynomial fit via the closed-form solution. Dashed purple is the true function.

A side-by-side comparison of the two failure modes makes the contrast concrete:

	Underfitting	Overfitting
Degree	Too low (1-2)	Too high (10+)
Training error	High	Very low
Test error	High	High
Symptom	Model misses the pattern	Model memorises noise
Bias	High	Low
Variance	Low	High

Increasing model complexity decreases bias but increases variance, so the optimal complexity is the one that balances the two against each other.

3. Training vs Validation Error

The practical way to detect overfitting is to split your data into training and validation sets and then plot error against model complexity. Training error almost always decreases (under optimal conditions) as the degree increases because a more flexible model can fit its own training data better. Validation error tells a different story, dropping at first as the model stops underfitting and then rising again once it starts overfitting. The optimal degree is the one where validation error is lowest, and the gap between the two lines is the cleanest indicator of overfitting. The noise level in the data also shifts that optimum, since cleaner data lets you afford a higher-degree polynomial while noisier data calls for a simpler model that refuses to chase the random fluctuations.

Polynomial fit at selected degree

Training & validation error vs degree

Degree: 3

Settings: 40 points sampled from y = sin(1.5x) + 0.5x + noise (sigma 0.5), shuffled and split 25 train / 15 validation, polynomial fit via the closed-form solution.

4. Polynomial Feature Magnitudes

In polynomial regression with degree \(d\), each input is expanded into features \([1, x, x^2, \ldots, x^d]\). This expansion is what gives the model its flexibility, but it also creates very large differences in feature magnitudes as the degree grows. If you train with gradient descent rather than the closed-form solution, feature scaling becomes essential for stable and efficient optimization. For a single input value \(x = 5\), the polynomial features grow quickly:

\[[1, x, x^2, x^3, \ldots, x^8] = [1, 5, 25, 125, 625, 3125, 15625, 78125, 390625]\]

All terms come from the same input, yet their scales differ by several orders of magnitude. The largest feature is close to \(4\times 10^5\), while the bias term stays at \(1\). The same effect shows up across different input values.

\(x\)	\(x^1\)	\(x^4\)	\(x^8\)
0.5	0.5	0.0625	0.0039
1.0	1.0	1.0	1.0
2.0	2.0	16.0	256.0
5.0	5.0	625.0	390625.0

This scale spread creates two practical issues during gradient-descent training. The first is uneven gradient magnitudes, where high-order features dominate the updates and low-order features barely move. The second is learning-rate sensitivity, where a step size that works well for one feature scale is far too aggressive or far too small for others. The standard fix is to center and scale the raw input first as \(x' = (x - \mu)/\sigma\) and then build \(x'^2, x'^3, \ldots\) from the normalised value, or alternatively to min-max scale into \([-1, 1]\) before the polynomial expansion. A small ridge term on top of either choice further improves conditioning and keeps the matrix inversion stable. We will cover regularization in more detail in the next guide, but the key takeaway is that polynomial regression is very sensitive to feature magnitudes and scaling is a must for gradient-based training.

5. Summary

Concept	Key Idea
Polynomial features	Add \(x^2, x^3, \ldots, x^d\) to turn nonlinear regression into linear regression on expanded features.
Degree as complexity	Higher degree means a more flexible model with more parameters and more capacity to overfit.
Underfitting	Model too simple to capture the pattern, high bias and high training error.
Overfitting	Model too flexible, fits noise as if it were signal, low training error but poor test error.
Bias-variance tradeoff	Total error decomposes into bias squared, variance, and irreducible noise; complexity trades one for another.
Validation curve	Training error keeps falling with degree while validation error follows a U-shape, so the minimum picks the best degree.
Feature scaling	Polynomial features grow rapidly with degree, so normalisation is essential for stable optimisation.

Polynomial regression is a clean demonstration of a universal principle in machine learning: model complexity must be matched to the signal-to-noise ratio in the data. Too simple and you miss the pattern, too complex and you memorise the noise. Manually choosing the right degree is fragile though, especially as the input dimension grows or the noise level shifts. The principled alternative is to use a flexible model and then control its complexity through regularisation, which adds a penalty term that shrinks the weights and smooths the curve even when the degree is high.

Continue the ML Series

This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series. Next up is Regularization, Ridge, Lasso & Elastic Net, where we will add a penalty term to prevent overfitting, explore the L1 and L2 landscapes interactively, and see how regularisation connects directly to the bias-variance tradeoff we built up here.