This is a follow-up to the first post Linear Regression. Now we move to the next step of multiple input features. We start with the simplest multivariate case of two features, because it can still be visualized. With two inputs, linear regression is no longer just a line and instead becomes a plane in 3D space. More generally, with additional features the same idea extends to a hyperplane in higher dimensions.
The full model with two features is:
\[\hat{y} = w_1 x_1 + w_2 x_2 + b\]where \(w_1\) and \(w_2\) control the tilt of the prediction plane, and \(b\) (the bias) shifts the entire plane up or down. In the first interactive demo below, you can adjust all three to build intuition for how the plane moves. For the cost surface and gradient descent sections further down, we fix \(b = 0\) so we have only two free parameters, which lets us directly visualize the cost surface \(J(w_1, w_2)\) as a 3D bowl and watch gradient descent walking down that bowl to find the best weights.
1. From a Line to a Plane
With one feature, the hypothesis was a line on a 2D plot (\(x\) vs \(y\)):
\[\hat{y} = w \cdot x + b\]With two features, the hypothesis becomes a plane in 3D space (\(x_1\), \(x_2\), \(y\)):
\[\hat{y} = w_1 x_1 + w_2 x_2 + b\]The weight \(w_1\) controls how steeply the plane tilts along the \(x_1\) direction, so increasing \(w_1\) means higher \(x_1\) values predict higher \(y\), and setting \(w_1 = 0\) makes the plane flat along \(x_1\) so the prediction no longer depends on \(x_1\) at all. The weight \(w_2\) does the same for the \(x_2\) direction and independently controls the other tilt axis. The bias \(b\) shifts the entire plane up or down without changing its tilt, with \(b = 0\) forcing the plane through the origin and a nonzero \(b\) letting the plane float to the right vertical position. Together, \(w_1\), \(w_2\), and \(b\) fully determine the prediction plane, and training means finding the values that make the plane pass as close as possible to all the data points.
2. Seeing the Data in 3D
Below is a 3D scatter plot of the training data. Each point lives at \((x_1, x_2, y)\) in space, the semi-transparent blue surface is the prediction plane \(\hat{y} = w_1 x_1 + w_2 x_2 + b\), and the red dashed lines are the errors (residuals), the vertical distance from each point to the plane.
Notice how the red error lines shrink when you find good weights and grow when the plane is tilted wrong. The cost \(J\) is the average of those squared red line lengths, the same MSE from the first post, just extended to two features.
3. The Cost Surface
For the visualizations below, we set \(b = 0\) so the cost depends on only two variables, \(w_1\) and \(w_2\). This lets us plot the cost as a 3D surface and a 2D contour, something impossible with three free parameters (that would need a 4D plot). The bias slider above still works for exploring the full model; down here we focus on the weight landscape.
The cost function measures how bad our current weights are:
\[J(w_1,w_2) = \frac{1}{2m}\sum_{i=1}^{m}\left(w_1 x_1^{(i)} + w_2 x_2^{(i)} - y^{(i)}\right)^2\]Every possible combination of \(w_1\) and \(w_2\) produces a different cost. Plotting all combinations gives us a cost surface: a 3D landscape where the horizontal axes are \(w_1\) and \(w_2\), and the vertical axis is the cost \(J\).
For linear regression with MSE, the loss surface is convex and has a global minimum, so any local minimum is also a global minimum. The contour plot on the right is a top-down view of the same surface, like a topographic map. This simple convex shape is specific to linear regression with MSE; in more complex models such as neural networks, the loss surface is often non-convex.
The bowl shape is key. No matter where you start on this surface, if you always step downhill, you reach the single lowest point. This is exactly what gradient descent does. To learn more about gradient descent, check out this post.
4. Training with Gradient Descent
The update rules for two weights are a natural extension of the single-weight case:
\[w_1 := w_1 - \alpha \cdot \frac{\partial J}{\partial w_1} \qquad w_2 := w_2 - \alpha \cdot \frac{\partial J}{\partial w_2}\]where the partial derivatives are:
\[\frac{\partial J}{\partial w_1} = \frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right) \cdot x_1^{(i)} \qquad \frac{\partial J}{\partial w_2} = \frac{1}{m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right) \cdot x_2^{(i)}\]Each derivative tells us how the cost changes if we slightly increase that weight, and we step in the opposite direction (minus sign) to reduce the cost. The learning rate \(\alpha\) controls how big each step is.
After running gradient descent for enough iterations, the green dot settles at the bowl’s minimum, and the plane in the right panel fits through the data. The convergence curve shows the cost rapidly decreasing at first and then flattening as it approaches the minimum, the same pattern we saw in the single-feature case.
What to Learn From This
With one feature, the model is a line; with two features, it becomes a plane; and with \(n\) features, it becomes a hyperplane in \((n+1)\)-dimensional space. Each weight \(w_k\) independently controls how much feature \(x_k\) contributes to the prediction, so setting \(w_k = 0\) means we ignore feature \(k\). Gradient descent generalizes naturally because each weight gets its own partial derivative, and all weights update simultaneously each iteration. We kept \(b = 0\) intentionally for clarity, but adding a bias \(b\) only adds one more dimension to the cost surface and changes nothing about how the algorithm works.
Continue the ML Series
This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.
References
- Ng, A. (2012). Machine Learning. Coursera / Stanford University. https://www.coursera.org/learn/machine-learning