Logistic regression is one of the most fundamental classification algorithms in machine learning. Despite the word “regression” in its name, logistic regression is used for classification, not regression. It is the natural next step after linear regression and shares many of the same ideas: a hypothesis function, a cost function, and gradient descent for optimization. In this interactive guide, we will build logistic regression completely from scratch using a concrete, intuitive example of predicting whether a student passes an exam based on the number of hours they studied. Given a set of students where we know both how many hours they studied and whether they passed, can we learn a model that predicts whether a new student will pass?
In this guide, you will:
- Understand the sigmoid function that maps any real number to a probability between 0 and 1
- Build the hypothesis function and the binary cross-entropy cost (log loss)
- Run gradient descent and explore the decision boundary that separates the two classes
- Apply the trained model on new data to make predictions
1. What is Classification?
In regression, we predict a continuous value such as house prices, while in classification, we predict a discrete category. The simplest form is binary classification with exactly two possible outcomes such as spam or not spam, pass or fail, tumor malignant or benign, or customer will buy or not buy. We encode these two outcomes as 0 and 1, where \(y = 0\) means the negative class (fail, not spam, benign) and \(y = 1\) means the positive class (pass, spam, malignant).
Why Not Use Linear Regression for Classification?
You might wonder whether we can just fit a straight line and use a threshold, classifying as 1 if the line predicts above 0.5 and as 0 if below. The problem is that linear regression can produce predictions far below 0 or far above 1, so for a student who studied 20 hours the line might predict 2.5, and for one who studied 0 hours it might predict -0.3, neither of which is a meaningful probability. What we really want is a model that always outputs a value between 0 and 1 we can interpret as a probability, for example “there is a 0.87 probability that this student will pass,” and this is exactly what logistic regression gives us.
2. The Training Dataset
Every machine learning model starts with data. Below we have 16 students with their hours studied and whether they passed (1) or failed (0). This is our training dataset, the set of labeled examples from which the model will learn patterns.
Notice the pattern that students who studied fewer hours tend to fail, while those who studied more tend to pass, with a region in the middle where the outcome is less certain. Our model needs to learn this boundary.
Looking at the plot, you can see a clear pattern: low study hours cluster near y=0 (fail) and high study hours cluster near y=1 (pass). Our goal is to find a smooth curve that separates these two classes, one that gives us a probability of passing for any number of hours studied.
3. The Sigmoid Function
We need a function that takes any real number and squashes it into the range \((0, 1)\). This function is the sigmoid (also called the logistic function):
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]The key properties of the sigmoid are that for large positive \(z\), \(e^{-z} \to 0\) so \(\sigma(z) \to 1\); for large negative \(z\), \(e^{-z} \to \infty\) so \(\sigma(z) \to 0\); at \(z = 0\) we get \(\sigma(0) = 1/(1+1) = 0.5\); and the output is always strictly between 0 and 1.
The sigmoid function is the key ingredient that transforms linear regression into logistic regression. Instead of predicting raw values, we pass the linear output through the sigmoid to get a probability.
4. The Hypothesis Function
In logistic regression, prediction happens in two simple steps:
- Compute a linear score \(z = w \cdot x + b\)
- Convert that score into a probability using sigmoid \(h(x) = \sigma(z) = \frac{1}{1 + e^{-z}}\)
So, written in one line:
\[h(x) = \sigma(w \cdot x + b) = \frac{1}{1 + e^{-(w \cdot x + b)}}\]This means the linear part \(w \cdot x + b\) gives a raw score, the sigmoid turns that score into a value between 0 and 1, and that value is the predicted probability of class 1 (for example, the probability of passing). The two parameters \(w\) and \(b\) control the shape and position of the sigmoid curve when plotted against the input \(x\). The weight \(w\) controls how steep the curve is, so larger \(\vert w \vert\) means a sharper transition from 0 to 1, and if \(w > 0\) the probability increases as \(x\) increases. The bias \(b\) moves the curve left or right and sets where the model reaches 0.5 probability, with the decision boundary at \(x = -b/w\). Together, \(w\) and \(b\) define the full probability curve, and training logistic regression means finding the values that make this curve fit the data best.
Notice how the weight controls the sharpness of the transition, while the bias slides the transition point left or right. The purple dashed line shows the decision boundary, the value of \(x\) where the model switches from predicting fail to predicting pass. To find the best sigmoid curve, we need a way to measure how well it fits the data, which is what the cost function does.
5. The Cost Function (Binary Cross-Entropy)
For linear regression, we used Mean Squared Error. Can we use it here? Technically yes, but it creates problems. When MSE is combined with the sigmoid function, the resulting cost surface is non-convex, full of local minima where gradient descent can get stuck.
Instead, we use binary cross-entropy (also called log loss):
\[J(w,b) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)} \cdot \log(h(x^{(i)})) + (1-y^{(i)}) \cdot \log(1-h(x^{(i)}))\right]\]For a single data point, the cost behaves intuitively. When \(y = 1\) (actual = pass), the cost is \(-\log(h(x))\), so if the model predicts \(h(x)\) close to 1 (correct) the penalty is \(-\log(1) = 0\), and if it predicts close to 0 (wrong) the penalty \(-\log(0) \to \infty\). When \(y = 0\) (actual = fail), the cost is \(-\log(1 - h(x))\), with the same logic flipped: predicting close to 0 gives no penalty, while predicting close to 1 gives an enormous penalty. The log loss therefore punishes confident wrong predictions severely, so a model that says “99% chance of pass” when the student failed pays a large cost, which is exactly what we want.
Manually tuning is difficult. Just like with linear regression, we need an automated algorithm to find the optimal parameters. But first, let us visualize what the cost landscape looks like.
6. The Cost Landscape
Every possible combination of \(w\) and \(b\) produces a different log loss value \(J(w,b)\), and plotting the cost for all combinations gives us a cost surface. For binary logistic regression with log loss, the objective is convex in parameters, so for typical non-perfectly-separable data we get a single basin and a unique finite minimum.
The lightest region on the contour plot represents the lowest cost, the optimal parameters. Notice how this dataset forms a clear single basin, so gradient descent can reliably move toward the minimum.
7. Gradient Descent
The gradient descent algorithm for logistic regression follows the same structure as linear regression. The key difference is that the hypothesis function now uses the sigmoid. The gradients turn out to have the same elegant form:
\[\frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right) \cdot x^{(i)}\] \[\frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right)\]These look identical to the linear regression gradients! The difference is hidden inside \(h(x^{(i)})\), which now equals \(\sigma(w \cdot x^{(i)} + b)\) instead of \(w \cdot x^{(i)} + b\).
The update rules are:
\[w := w - \alpha \cdot \frac{\partial J}{\partial w}\] \[b := b - \alpha \cdot \frac{\partial J}{\partial b}\]Just like in linear regression, the learning rate \(\alpha\) controls the step size: too small a value makes convergence slow, while too large a value makes the algorithm overshoot and diverge.
After running gradient descent for enough iterations, the green dot settles at the bottom of the cost surface (minimum cost), and the sigmoid curve fits the data well. The convergence curve shows the cost rapidly decreasing at first and then flattening as it approaches the minimum.
8. The Decision Boundary
Once we have trained the model and found the optimal \(w\) and \(b\), we need a rule for converting the predicted probability into a class prediction. The standard approach is to use a threshold of 0.5: predict class 1 (pass) if \(h(x) \geq 0.5\) and class 0 (fail) otherwise. The decision boundary is the value of \(x\) where \(h(x) = 0.5\), and since \(\sigma(z) = 0.5\) when \(z = 0\), the decision boundary occurs when:
\[w \cdot x + b = 0 \quad \Rightarrow \quad x = -\frac{b}{w}\]Everything to the left of this boundary is classified as fail, and everything to the right is classified as pass (assuming \(w > 0\)).
The decision boundary is a powerful concept. In our one-dimensional example, it is a single point on the x-axis. In higher dimensions (multiple features), the decision boundary becomes a line, a plane, or a hyperplane that separates the classes.
9. Implementing from Scratch
Let us put together the complete algorithm step-by-step:
Algorithm: Single-Feature (Univariate) Logistic Regression
- Initialize \(w = 0\) and \(b = 0\) (starting point)
- Choose a learning rate \(\alpha\) and number of iterations
- For each iteration, repeat:
- Compute predictions: \(h(x^{(i)}) = \sigma(w \cdot x^{(i)} + b)\) for all data points
- Compute gradients:
- \[\frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)}) \cdot x^{(i)}\]
- \[\frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})\]
- Update parameters:
- \[w := w - \alpha \cdot \frac{\partial J}{\partial w}\]
- \[b := b - \alpha \cdot \frac{\partial J}{\partial b}\]
import math
def sigmoid(z):
return 1.0 / (1.0 + math.exp(-z))
def logistic_regression(X, y, lr=0.1, iterations=3000):
w, b = 0.0, 0.0
m = len(X)
for _ in range(iterations):
# Predictions
h = [sigmoid(w * x + b) for x in X]
# Gradients
dw = sum((h[i] - y[i]) * X[i] for i in range(m)) / m
db = sum((h[i] - y[i]) for i in range(m)) / m
# Update parameters
w -= lr * dw
b -= lr * db
# Final cost (log loss) using final parameters
eps = 1e-15
h = [min(max(sigmoid(w * x + b), eps), 1 - eps) for x in X]
cost = -sum(
y[i] * math.log(h[i]) +
(1 - y[i]) * math.log(1 - h[i])
for i in range(m)
) / m
return w, b, cost
# Example usage
X = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 9]
y = [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1]
w, b, cost = logistic_regression(X, y)
print(f"w = {w:.4f}, b = {b:.4f}, cost = {cost:.4f}")
# Predict
hours = 5.0
prob = sigmoid(w * hours + b)
print(f"P(pass | {hours} hours) = {prob:.4f}")
print(f"Prediction: {'PASS' if prob >= 0.5 else 'FAIL'}")
10. Making Predictions
Once we have trained our model and found the optimal values of \(w\) and \(b\), making predictions is straightforward: compute the probability \(P(\text{pass}) = \sigma(w_{trained} \cdot x_{new} + b_{trained})\) and apply the threshold to predict PASS if \(P(\text{pass}) \geq 0.5\) and FAIL otherwise. For example, if we trained and found \(w = 2.0\) and \(b = -10.0\), then for a student who studied 6 hours:
\[P(\text{pass}) = \sigma(2.0 \times 6 + (-10.0)) = \sigma(2.0) = 0.88\]Since \(0.88 \geq 0.5\), we predict PASS, with the model’s estimated probability being 88%.
Summary
Here is everything we covered, building logistic regression completely from the ground up:
| Concept | What it does | Formula |
|---|---|---|
| Sigmoid function | Squashes any value to (0,1) | \(\sigma(z) = \frac{1}{1+e^{-z}}\) |
| Hypothesis function | Predicts probability of class 1 | \(h(x) = \sigma(wx + b)\) |
| Cost function (log loss) | Measures classification error | \(J = -\frac{1}{m}\sum[y\log(h) + (1-y)\log(1-h)]\) |
| Gradient | Direction of steepest ascent | \(\frac{\partial J}{\partial w}, \frac{\partial J}{\partial b}\) |
| Gradient descent | Updates parameters to reduce cost | \(w := w - \alpha \frac{\partial J}{\partial w},\; b := b - \alpha \frac{\partial J}{\partial b}\) |
| Decision boundary | Threshold for classification | \(x = -b/w\) (where \(h(x) = 0.5\)) |
| Prediction | Uses trained model on new data | \(\hat{y} = \begin{cases}1 & h(x) \geq 0.5 \\ 0 & h(x) < 0.5\end{cases}\) |
The logistic regression model shares the same fundamental framework as linear regression: hypothesis, cost function, and gradient descent. The key differences are the sigmoid activation, the log loss cost function, and the classification threshold.
Continue the ML Series
This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series.
References
- Ng, A. (2012). Machine Learning. Coursera / Stanford University. https://www.coursera.org/learn/machine-learning