Linear Regression II

This is a follow-up to the first post Linear Regression. Now we move to the next step of multiple input features. We start with the simplest multivariate case of two features, because it can still be visualized. With two inputs, linear regression is no longer just a line and instead becomes a plane in 3D space. More generally, with additional features the same idea extends to a hyperplane in higher dimensions.

The full model with two features is:

\[\hat{y} = w_1 x_1 + w_2 x_2 + b\]

where \(w_1\) and \(w_2\) control the tilt of the prediction plane, and \(b\) (the bias) shifts the entire plane up or down. In the first interactive demo below, you can adjust all three to build intuition for how the plane moves. For the cost surface and gradient descent sections further down, we fix \(b = 0\) so we have only two free parameters, which lets us directly visualize the cost surface \(J(w_1, w_2)\) as a 3D bowl and watch gradient descent walking down that bowl to find the best weights.

1. From a Line to a Plane

With one feature, the hypothesis was a line on a 2D plot (\(x\) vs \(y\)):

\[\hat{y} = w \cdot x + b\]

With two features, the hypothesis becomes a plane in 3D space (\(x_1\), \(x_2\), \(y\)):

\[\hat{y} = w_1 x_1 + w_2 x_2 + b\]

The weight \(w_1\) controls how steeply the plane tilts along the \(x_1\) direction, so increasing \(w_1\) means higher \(x_1\) values predict higher \(y\), and setting \(w_1 = 0\) makes the plane flat along \(x_1\) so the prediction no longer depends on \(x_1\) at all. The weight \(w_2\) does the same for the \(x_2\) direction and independently controls the other tilt axis. The bias \(b\) shifts the entire plane up or down without changing its tilt, with \(b = 0\) forcing the plane through the origin and a nonzero \(b\) letting the plane float to the right vertical position. Together, \(w_1\), \(w_2\), and \(b\) fully determine the prediction plane, and training means finding the values that make the plane pass as close as possible to all the data points.

2. Seeing the Data in 3D

Below is a 3D scatter plot of the training data. Each point lives at \((x_1, x_2, y)\) in space, the semi-transparent blue surface is the prediction plane \(\hat{y} = w_1 x_1 + w_2 x_2 + b\), and the red dashed lines are the errors (residuals), the vertical distance from each point to the plane.

True w₁: True w₂: True b: Noise: Samples:

w₁: 0.00 w₂: 0.00 b: 0.00

Adjust w₁ and w₂ to tilt the plane through the data.

Prediction plane

Rotate: 35° Tilt: 30° Show errors

Drag w₁, w₂ and b to tilt and shift the plane; click Fit to run gradient descent automatically.

Notice how the red error lines shrink when you find good weights and grow when the plane is tilted wrong. The cost \(J\) is the average of those squared red line lengths, the same MSE from the first post, just extended to two features.

3. The Cost Surface

For the visualizations below, we set \(b = 0\) so the cost depends on only two variables, \(w_1\) and \(w_2\). This lets us plot the cost as a 3D surface and a 2D contour, something impossible with three free parameters (that would need a 4D plot). The bias slider above still works for exploring the full model; down here we focus on the weight landscape.

The cost function measures how bad our current weights are:

\[J(w_1,w_2) = \frac{1}{2m}\sum_{i=1}^{m}\left(w_1 x_1^{(i)} + w_2 x_2^{(i)} - y^{(i)}\right)^2\]

Every possible combination of \(w_1\) and \(w_2\) produces a different cost. Plotting all combinations gives us a cost surface: a 3D landscape where the horizontal axes are \(w_1\) and \(w_2\), and the vertical axis is the cost \(J\).

For linear regression with MSE, the loss surface is convex and has a global minimum, so any local minimum is also a global minimum. The contour plot on the right is a top-down view of the same surface, like a topographic map. This simple convex shape is specific to linear regression with MSE; in more complex models such as neural networks, the loss surface is often non-convex.

3D Cost Surface J(w₁, w₂)

Current weights

Contour Plot (top-down view)

Cost contour

Surface Rotate: Surface Tilt:

Cost and position update as you adjust w₁, w₂ above.

Settings: bias fixed at b = 0, weights swept over w₁, w₂ in [-6, 6]. The red dot shows current weights; drag the green dot on the contour to explore the cost surface.

The bowl shape is key. No matter where you start on this surface, if you always step downhill, you reach the single lowest point. This is exactly what gradient descent does. To learn more about gradient descent, check out this post.

4. Training with Gradient Descent

The update rules for two weights are a natural extension of the single-weight case:

\[w_1 := w_1 - \alpha \cdot \frac{\partial J}{\partial w_1} \qquad w_2 := w_2 - \alpha \cdot \frac{\partial J}{\partial w_2}\]

where the partial derivatives are:

\[\frac{\partial J}{\partial w_1} = \frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right) \cdot x_1^{(i)} \qquad \frac{\partial J}{\partial w_2} = \frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right) \cdot x_2^{(i)}\]

Each derivative tells us how the cost changes if we slightly increase that weight, and we step in the opposite direction (minus sign) to reduce the cost. The learning rate \(\alpha\) controls how big each step is.

Gradient Descent Path

Optimization path

Prediction Plane Converging

Fitted plane

Cost vs. iteration

Learning rate α: 0.040

Iteration: 0 | w₁ = 0.00, w₂ = 0.00, Cost = -

Settings: starts at w₁ = 0, w₂ = 0, default α = 0.04, bias fixed at b = 0. Step runs one iteration; Run animates continuously.

After running gradient descent for enough iterations, the green dot settles at the bowl’s minimum, and the plane in the right panel fits through the data. The convergence curve shows the cost rapidly decreasing at first and then flattening as it approaches the minimum, the same pattern we saw in the single-feature case.

What to Learn From This

With one feature, the model is a line; with two features, it becomes a plane; and with \(n\) features, it becomes a hyperplane in \((n+1)\)-dimensional space. Each weight \(w_k\) independently controls how much feature \(x_k\) contributes to the prediction, so setting \(w_k = 0\) means we ignore feature \(k\). Gradient descent generalizes naturally because each weight gets its own partial derivative, and all weights update simultaneously each iteration. We kept \(b = 0\) intentionally for clarity, but adding a bias \(b\) only adds one more dimension to the cost surface and changes nothing about how the algorithm works.

Continue the ML Series

This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.

References

Ng, A. (2012). Machine Learning. Coursera / Stanford University. https://www.coursera.org/learn/machine-learning