Linear Regression

Linear regression is one of the most important starting points in machine learning. It is simple enough to understand deeply, but powerful enough to teach ideas used in larger models: defining a model, measuring error, and improving parameters with optimization. In this interactive guide, we will build linear regression from scratch using one concrete problem of predicting house prices from area, asking whether we can learn a formula that predicts price for a new house given area and price data for past houses.

In this guide, you will:

Understand the hypothesis function as the prediction formula
See how the cost function measures prediction error
Use gradient descent to optimize parameters and learn why the learning rate matters
Apply trained parameters to make predictions on new inputs

1. What is Linear Regression?

Linear regression is a supervised learning algorithm. “Supervised” means we train on examples where both input and correct output are known. The model learns a mapping from input to output, then uses that mapping on unseen inputs.

In our example:

Input (feature): House area in square feet (we call this $x$)
Output (label): House price in thousands of dollars (we call this $y$)

The “linear” part means the model assumes a straight-line relationship between input and output. It is the simplest possible form, and often a strong baseline.

A simple linear equation looks like:

\[y = m \cdot x + c\]

$m$ is the slope of the line and $c$ is the y-intercept (where the line crosses the y-axis). In machine learning, we use different notation:

\[\hat{y} = w \cdot x + b\]

where:

$w$ stands for weight (same as slope $m$)
$b$ stands for bias (same as y-intercept $c$)
$\hat{y}$ (“y-hat”) is the predicted value (to distinguish it from the actual value $y$)

The goal is simple: given data points $(x, y)$, find $w$ and $b$ so that $\hat{y} = wx + b$ fits the data as closely as possible.

2. The Training Dataset

Every machine learning model starts with data. Below, we have 10 houses with area (sq ft) and price (in $1000s). This is our training dataset: labeled examples the model uses to learn a pattern. It is only a simplified example and does not represent real market prices, but it is sufficient for understanding linear regression.

10 points

Settings: 10 starter points (area in sq ft vs price in $1000s). Click to add a point, drag to move, double-click to remove. All demos below share this dataset.

Looking at the plot, there is a clear trend: larger area usually means higher price. The points are not perfectly on one line, and that is normal. Our goal is to find a best-fit line that keeps overall error as small as possible across all points.

3. The Hypothesis Function

In machine learning, the hypothesis function is the model’s prediction formula. It maps input to predicted output. For linear regression, the hypothesis is:

\[h(x) = w \cdot x + b\]

This is a straight line and the two parameters $w$ and $b$ completely determine it. The weight $w$ controls the slope, so a larger $w$ means price rises faster with area, $w = 0$ produces a flat line, and a negative $w$ flips the slope so price decreases as area increases. The bias $b$ controls the y-intercept and shifts the line up or down without changing slope, which you can think of as a base level before area contributes through $w$. Together, $w$ and $b$ are the model’s parameters, and training means finding the values that produce the best fit.

Weight (w): 0.150 Bias (b): 30

h(x) = 0.150 * x + 30

Settings: w in [-0.1, 0.4], b in [-100, 300], starting at w = 0.15, b = 30. w controls slope; b shifts the line vertically.

Notice how weight changes steepness while bias shifts vertically. To find the best fit, we need a precise way to measure “how wrong” a line is. That is the cost function.

4. The Cost Function

The cost function (also called loss or objective) is a single number that tells us how wrong the current model is. High cost means predictions are far from actual values. Low cost means the line fits well. Training is the process of finding $w$ and $b$ that minimize this value.

The most common cost function for linear regression is the Mean Squared Error (MSE), which traces back to the method of least squares:

\[J(w,b) = \frac{1}{2m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right)^2\]

Let us break this down:

$h(x^{(i)}) - y^{(i)}$ is the error (residual) for one point: prediction minus actual value.
$(\ldots)^2$ squares each error, so positives and negatives do not cancel, and large misses are penalized more strongly.
$\sum_{i=1}^{m}$ adds squared errors across all $m$ points.
$\frac{1}{2m}$ averages over the dataset. The extra $\frac{1}{2}$ is a convenience that simplifies derivatives.

w: 0.100 b: 80

Cost J(w,b) = 0.00

Settings: starting at w = 0.10, b = 80, MSE cost. Red squares show squared error per point; cost J updates live.

Try finding the minimum manually with sliders. It is harder than it looks, because changing $w$ affects the best value of $b$, and vice versa. This is why we need an automated optimizer. First, let us visualize the full cost landscape.

5. The Cost Landscape

Every pair $(w, b)$ gives a different cost $J(w,b)$. If we evaluate many pairs, we get a cost surface: a 3D landscape where horizontal axes are $w$ and $b$, and height is cost. For linear regression with MSE, this surface is bowl-shaped (convex). That is useful because it has one global minimum: a single best parameter set.

Contour Plot View

A contour plot is a top-down view of this surface, like a topographic map. Each band represents a cost level. Moving toward lighter center regions means lower cost.

Cost contour

Fitted line

w = 0.15, b = 50, Cost = 0.00

Settings: drag the green dot on the contour to set (w, b); the right panel shows the corresponding fitted line. Lighter regions mean lower cost.

3D Surface View

Here is the same cost function visualized as a 3D surface. You can see the bowl shape clearly and there is one lowest point (the global minimum) that represents the optimal parameters.

Rotate: 35 deg Tilt: 30 deg

Settings: cost surface J(w, b) over w in [-0.05, 0.35] and b in [-100, 300]. Drag the sliders to rotate and tilt the view.

The bowl shape is important. In convex problems like linear regression (and logistic regression with a convex loss), moving downhill can lead to the same global minimum from different start points. Gradient descent follows this idea, but this guarantee holds only under certain conditions (such as convexity and a suitable learning rate).

6. Gradient Descent

Gradient descent is the optimization algorithm that helps parameters move toward the minimum cost. The same idea scales to much larger models, including neural networks.

The Intuition: Lost on a Foggy Mountain

Imagine you are standing on a mountain in thick fog. You cannot see the whole landscape, but you still need to find your way to the valley floor. What do you do? You cannot jump to the bottom, and you cannot see the best path. However, you can feel the slope of the ground under your feet. You can figure out which direction goes downhill the steepest, take a step in that direction, and repeat. Eventually, you will reach the bottom. This is essentially how gradient descent works.

Compute the gradient at the current $w, b$ (the direction of steepest increase)
Move opposite to the gradient (downhill)
Repeat until the improvement becomes very small

The Math

The gradient is the vector of partial derivatives of the cost function with respect to each parameter. For our two parameters $w$ and $b$:

\[\frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right) \cdot x^{(i)}\] \[\frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}\left(h(x^{(i)}) - y^{(i)}\right)\]

These derivatives tell us local sensitivity, where $\frac{\partial J}{\partial w}$ tells us how cost changes if we increase $w$ slightly, and $\frac{\partial J}{\partial b}$ does the same for $b$. The update rules are:

\[w := w - \alpha \cdot \frac{\partial J}{\partial w}\] \[b := b - \alpha \cdot \frac{\partial J}{\partial b}\]

The minus sign makes the update move against the gradient, which reduces cost, and the learning rate $\alpha$ sets step size. If a gradient component is positive, subtracting it decreases that parameter; if it is negative, subtracting it increases that parameter, so one rule handles both directions automatically.

Optimization path

Fitted line

Cost vs. iteration

alpha (10^x): 1.0e-7

Iteration: 0 | w = 0.0000, b = 0.0, Cost = -

Settings: starts at w = 0, b = 0, default alpha = 1e-7. Step runs one iteration; Run animates continuously; the green path traces the optimization trajectory.

After enough iterations, the green dot settles near the bottom of the bowl and the fitted line stabilizes. Cost usually drops quickly early on, then flattens near convergence.

7. The Learning Rate

The learning rate $\alpha$ is a crucial hyperparameter that you choose before training. It controls the step size in each gradient descent update. Choosing it well is important: too small a value makes steps tiny and convergence very slow, a reasonable value gives smooth and stable convergence, and too large a value causes updates to overshoot, with cost oscillating or even diverging. There is no universal best value, so in practice you try a few values and watch the cost curve.

A quick rule of thumb is to decrease $\alpha$ if the cost explodes or oscillates, increase it if the cost decreases very slowly, and keep the largest value that still gives stable, smooth convergence.

alpha =

Too slow - Cost: -

alpha =

Just right - Cost: -

alpha =

Too fast - Cost: -

Settings: same dataset and starting point (w = 0, b = 0) across all three runs; only alpha differs. Edit the rate values to compare convergence speeds.

8. Implementing from Scratch

Let us put the complete algorithm together step by step:

Algorithm A: Single-Feature Linear Regression

Initialize $w = 0$ and $b = 0$ (starting point)
Choose a learning rate $\alpha$ and number of iterations
For each iteration, repeat:
- Compute predictions: $\hat{y}^{(i)} = w \cdot x^{(i)} + b$ for all data points
- Compute gradients:
  - \[\frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)}\]
  - \[\frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})\]
- Update parameters:
  - \[w := w - \alpha \cdot \frac{\partial J}{\partial w}\]
  - \[b := b - \alpha \cdot \frac{\partial J}{\partial b}\]

def linear_regression(X, y, lr=1e-7, iterations=5000):
    w, b = 0.0, 0.0
    m = len(X)

    for _ in range(iterations):
        y_pred = [w * x + b for x in X]
        dw = sum((y_pred[i] - y[i]) * X[i] for i in range(m)) / m
        db = sum((y_pred[i] - y[i]) for i in range(m)) / m
        w -= lr * dw
        b -= lr * db

    cost = sum((w * X[i] + b - y[i])**2 for i in range(m)) / (2 * m)
    return w, b, cost

In this simplified code, we train directly on raw area values, so a very small learning rate is used. In practice, feature scaling usually lets you train with larger and more stable learning rates.

Walk through this code step by step and watch each line update live values and the fitted line:

Settings: 5 sample points, lr = 1e-7, 5000 iterations. Step through each line of the algorithm or auto-play to watch the line fit the data.

// Edit these values and click Run
var learning_rate = 0.0000001;
var iterations = 5000;
var w = 0, b = 0;

// Runs gradient descent on your dataset

Click "Run" to train the model...

Settings: edit the parameters above (default lr = 1e-7, 5000 iterations, w = 0, b = 0) and click Run. Trained parameters are saved and used automatically by the Prediction section.

9. Making Predictions

Once we have trained the model and found $w$ and $b$, prediction is direct substitution into the hypothesis:

\[\hat{y}_{new} = w_{trained} \cdot x_{new} + b_{trained}\]

For example, if training gives $w = 0.151$ and $b = 42.2$, then for a 2800 sq ft house:

\[\hat{y} = 0.151 \times 2800 + 42.2 = 465.0\]

So the predicted price is approximately $465,000. This value is in thousands of dollars, so 465.0 means about $465,000. The model did not use hand-written pricing rules and instead learned a pattern from data. One important caveat is that predictions are usually more reliable within the training range than far outside it. Predicting a 12,000 sq ft house from data mostly between 800 and 3,800 sq ft is extrapolation, and can be inaccurate.

Area (sq ft):

Enter an area and click Predict (or Auto-Train first if needed)

Settings: uses trained parameters from above. Click Auto-Train first if needed, then enter an area and click Predict.

Summary

Here is everything we covered, building linear regression completely from the ground up:

Concept	What it does	Formula
Hypothesis function	Predicts output from input	$h(x) = wx + b$
Cost function (MSE)	Measures prediction error	$J = \frac{1}{2m}\sum(h(x^{(i)}) - y^{(i)})^2$
Gradient	Direction of steepest ascent	$\frac{\partial J}{\partial w}, \frac{\partial J}{\partial b}$
Gradient descent	Updates parameters to reduce cost	$w := w - \alpha \frac{\partial J}{\partial w}$
Learning rate ($\alpha$)	Controls step size	Hyperparameter (you choose)
Prediction	Uses trained model on new data	$\hat{y} = w_{trained} \cdot x + b_{trained}$

These same ideas appear again in larger models: define a differentiable objective, compute gradients, and iteratively optimize parameters.

Continue the ML Series

This post is part of a bigger Interactive Machine Learning series. If you would like to learn more, check out the other posts in this series:

Linear Regression Part 2 (multiple features): Linear Regression II: Multivariate Extension

References

Ng, A. (2012). Machine Learning. Coursera / Stanford University. https://www.coursera.org/learn/machine-learning

Concept	What it does	Formula
Hypothesis function	Predicts output from input	\(h(x) = wx + b\)
Cost function (MSE)	Measures prediction error	\(J = \frac{1}{2m}\sum(h(x^{(i)}) - y^{(i)})^2\)
Gradient	Direction of steepest ascent	\(\frac{\partial J}{\partial w}, \frac{\partial J}{\partial b}\)
Gradient descent	Updates parameters to reduce cost	\(w := w - \alpha \frac{\partial J}{\partial w}\)
Learning rate (\(\alpha\))	Controls step size	Hyperparameter (you choose)
Prediction	Uses trained model on new data	\(\hat{y} = w_{trained} \cdot x + b_{trained}\)