Logistic Regression II

This is the follow-up to the first post Logistic Regression. In that post, we built logistic regression for a single feature, one input \(x\) (hours studied), and the decision boundary was a single point on the number line separating pass from fail. Now we take the natural next step of two input features by predicting whether a student passes or fails an exam based on both hours studied and attendance score, and the decision boundary goes from a point on a line to an actual line in 2D space.

The full model with two features is:

\[h(x) = \sigma(w_1 x_1 + w_2 x_2 + b)\]

where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function. The weights \(w_1\) and \(w_2\) control the orientation of the decision boundary line, and \(b\) (the bias) shifts it. The decision boundary itself is the set of points where \(w_1 x_1 + w_2 x_2 + b = 0\).

1. From a Point to a Line

With one feature, the decision boundary was a single threshold on the \(x\)-axis, with everything to the right classified as 1 and everything to the left as 0:

\[h(x) = \sigma(w \cdot x + b)\]

With two features, the decision boundary becomes a line in 2D space (\(x_1\) vs \(x_2\)):

\[h(x) = \sigma(w_1 x_1 + w_2 x_2 + b)\]

The weight \(w_1\) controls how much hours studied (\(x_1\)) influences the prediction, and a larger \(w_1\) means more study hours push the prediction toward passing. The weight \(w_2\) does the same for attendance score (\(x_2\)) and works independently from \(w_1\). The bias \(b\) shifts the decision boundary without changing its orientation, so a more negative \(b\) makes the model harder to satisfy and requires more study and attendance to predict a pass. The decision boundary line is where the model predicts exactly 50% probability, that is, where \(w_1 x_1 + w_2 x_2 + b = 0\), and on one side the model predicts pass while on the other it predicts fail.

2. Seeing the Data in 3D

Below is a 3D scatter plot of the training data. Each point lives at \((x_1, x_2, y)\) in space, where \(y\) is either 0 (fail) or 1 (pass), and the curved surface is the sigmoid probability surface \(h(x) = \sigma(w_1 x_1 + w_2 x_2 + b)\) that smoothly transitions from 0 to 1. The decision boundary is where the surface crosses the 0.5 probability level, with the model predicting pass on one side and fail on the other.

True w₁: True w₂: True b: Noise: Samples:

w₁: 0.00 w₂: 0.00 b: 0.00

Adjust w₁, w₂, and b to shape the sigmoid surface.

Sigmoid surface

Rotate: 35° Tilt: 30°

Settings: Drag w₁ and w₂ to tilt the sigmoid surface and b to shift it; click Fit to run gradient descent on all three.

Notice how the sigmoid surface curves between 0 and 1, red points (fail) sit near the bottom where the surface is low, and blue points (pass) cluster near the top. The dashed line at P = 0.5 shows the decision boundary. Points near this boundary are the hardest to classify since the sigmoid outputs values close to 0.5 there.

3. The Decision Boundary

The boundary equation \(w_1 x_1 + w_2 x_2 + b = 0\) is just the equation of a line. You can rearrange it to the familiar slope-intercept form:

\[x_2 = -\frac{w_1}{w_2}\,x_1 - \frac{b}{w_2}\]

(This rearranged form assumes \(w_2 \neq 0\).)

This makes it clear what each parameter controls:

The slope of the boundary is \(-w_1 / w_2\). Changing the ratio of \(w_1\) to \(w_2\) rotates the line.
The intercept is \(-b / w_2\). Changing \(b\) slides the line up or down without rotating it.

On one side of the boundary (where \(w_1 x_1 + w_2 x_2 + b > 0\)), the sigmoid outputs values above 0.5, so the model predicts class 1. On the other side, it predicts class 0. The further a point is from the boundary, the more confident the prediction. As in the single-feature case, the 0.5 threshold is a policy choice. You could choose a different threshold to trade off precision and recall, but 0.5 is the most common default.

4. The Cost Surface

For the visualizations below, we set \(b = 0\) so the cost depends on only two variables, \(w_1\) and \(w_2\). This lets us plot the cost as a 3D surface and a 2D contour, something impossible with three free parameters. The bias slider above still works for exploring the full model; down here we focus on the weight landscape.

The cost function for logistic regression is the binary cross-entropy:

\[J(w_1,w_2) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log h(x^{(i)}) + (1-y^{(i)})\log(1-h(x^{(i)}))\right]\]

where \(h(x^{(i)}) = \sigma(w_1 x_1^{(i)} + w_2 x_2^{(i)})\). Every possible combination of \(w_1\) and \(w_2\) produces a different cost, and plotting all combinations gives us a cost surface. Compared with linear regression, the shape is usually less perfectly quadratic, but with binary cross-entropy the objective is still convex in the parameters, which in practice gives one basin for typical non-separable data.

3D Cost Surface J(w₁, w₂)

Current weights

Contour Plot (top-down view)

Cost contour

Surface Rotate: Surface Tilt:

Cost and position update as you adjust w₁, w₂ above.

Cross-entropy loss gives a large penalty when the model is very confident and still wrong. For example, if the model predicts a very high probability for “pass” but the true label is “fail,” the loss becomes large. Because of this, the cost surface rises quickly in regions where the weights lead to strongly wrong predictions. The minimum is reached when the weights produce predictions that separate the two classes as well as possible.

5. Training with Gradient Descent

The update rules for two weights are the same form as linear regression, but with the sigmoid applied:

\[w_1 := w_1 - \alpha \cdot \frac{\partial J}{\partial w_1} \qquad w_2 := w_2 - \alpha \cdot \frac{\partial J}{\partial w_2}\]

where the partial derivatives are:

\[\frac{\partial J}{\partial w_1} = \frac{1}{m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right) \cdot x_1^{(i)} \qquad \frac{\partial J}{\partial w_2} = \frac{1}{m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right) \cdot x_2^{(i)}\]

Notice how similar this looks to the linear regression gradient. The only difference is that \(\hat{y}^{(i)}\) is replaced by \(h(x^{(i)}) = \sigma(w_1 x_1^{(i)} + w_2 x_2^{(i)})\), so the sigmoid introduces nonlinearity but the gradient formula stays elegant.

Gradient Descent Path

Optimization path

Sigmoid Surface Converging

Fitted surface

Cost vs. iteration

Learning rate α: 0.100

Iteration: 0 | w₁ = 0.00, w₂ = 0.00, Cost J(b=0) = ―

After running gradient descent for enough iterations, the green dot settles near the cost minimum, and the sigmoid surface on the right panel shapes itself to separate the classes as well as possible. The convergence curve shows the cost rapidly decreasing at first and then flattening as it approaches the minimum. Note that the gradient descent demo fixes \(b = 0\), so it will not find a perfect boundary in general. The Fit button in the first demo optimizes all three parameters (\(w_1\), \(w_2\), and \(b\)) for the best result.

What to Learn From This

With one feature, the decision boundary is a point on the number line, with two features it becomes a line in 2D, and with \(n\) features it becomes a hyperplane in \(n\)-dimensional space. Each weight \(w_k\) controls how much feature \(x_k\) contributes to the classification, so setting \(w_k = 0\) means we ignore feature \(k\). The cost surface for logistic regression with binary cross-entropy is convex in parameters, which means there are no local minima and gradient descent will always find the global minimum if run long enough with a suitable learning rate. The gradient formulas look almost identical to linear regression, with the only difference being that the prediction \(\hat{y}\) is replaced by \(h(x) = \sigma(w \cdot x + b)\), and this elegance comes from the choice of cross-entropy as the cost function. We kept \(b = 0\) for the cost surface and gradient descent visualizations because adding a bias \(b\) only adds one more dimension and changes nothing about how the algorithm works, while the first demo above lets you explore the full model with all three parameters.

Continue the ML Series

This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.

References

Ng, A. (2012). Machine Learning. Coursera / Stanford University. https://www.coursera.org/learn/machine-learning