The Line of Best Fit
Ultimately, machine learning is about generating a statistical model of a given dataset, so that you can make predictions about where new data points are likely to be.
A useful place to start is regression, which is the process of finding a line of best fit with some data.
Let’s imagine a dataset that maps the relationship between hours of sleep and cups of coffee consumed:
| Hours of sleep | Cups of coffee |
|---|---|
| 4 | 4.2 |
| 5 | 3.5 |
| 5.5 | 3.1 |
| … | … |
Just by eyeballing the data, you can see a trend: hours of sleep go up as cups of coffee go down.
We can visualise this trend by plotting the data on a chart and drawing a line that passes as close to each data point as possible.
We can then use this line to predict new values that were not in the original data (e.g. 4.5 hours of sleep).
Have a play with the chart by hovering over the line to see the values it predicts.
The straight line formula
A straight line on a chart is described by , a formula you probably learned in school. It describes how any given values for and relate to each other, using:
- (the slope of the line)
- (the point at which the line intercepts the Y-axis).
Have a play with the sliders and see how and affect the line. You can
also change the formula to see why, for example, m**x - c or m/x + c don’t
work.
Finding the best fit
Putting this together, if we can establish what and are, we can generate a straight line to make a line of best fit for our data. This will allow us to predict for any possible .
See if you can find the best and to get as close to the data as possible.
We have manually created our first linear model. The model (line of best fit) allows us to predict novel, unseen values based on data.
The Linear Model
The linear model is the LEGO brick of machine learning. Its job is to model the line of best fit - simple as that. We pre-load it with our best values for and , and then we can give it any and it will spit out a .
The code for a simple linear model might look like this:
Training the Linear Model
Obviously, we do not know the optimal values for the weight (slope) and bias (intercept) up front, so we need a way to move the line of best fit around programmatically until we arrive at the right values.
The approach we use to do this is called Linear Regression.
For this, we need the following:
- a model of the data that makes predictions
- a way to quantify how correct the predictions are (mentioned earlier: this is called the loss)
- a way to use this information to re-calibrate the model to make better predictions
If we have those three ingredients, we can run a loop to build our model of the data - see the pseudocode below:
// Create our model
const model = linearModel()
// Starting loss - the aim is to whittle this down
let loss = Infinity
// Set the loss that we will be happy with
const targetLoss = 0.01
// Set some sort of ceiling on the number of attempts
const maxIterations = 100
let i = 0
while(i < maxIterations && loss > targetLoss) {
data.forEach(({ x: input, y: actual}) => {
// Make a prediction for Y
const prediction = model.predict(input)
// Compare to the actual value of Y and figure out how
// wrong we were (loss)
loss = calculateLoss(prediction, actual)
// The missing piece! Use the loss to adjust weights and biases
model.adjust(loss)
})
i ++
}
// Now we should have a model that makes accurate predictions
model.predict(someNewValue) // ✨ accurate prediction!
This process is referred to as training - we run the model, look at how wrong its prediction is (the loss), and tweak the weight and bias accordingly to reduce the loss. Rinse and repeat until a target loss is reached.
Only one missing piece now: The technique used to adjust the model based on the loss, which is called gradient descent.
Gradient Descent
So, the task is as follows: we ran y = weight * x + bias, and we know the result
was off by a certain amount. We need a way to change weight and bias
and reduce the loss.
The first thing we need to do is figure out how much the weight and the bias
each contributed to the loss. Obviously, since one was multiplying by and the
other was an addition, their contributions will be different.
Luckily, there is a branch of mathematics all about how tiny changes affect an output — calculus — and we can cherry pick a couple of simple principles that will help us.
Stepping back, let’s think about how the output changes when we nudge the weight by a small amount (we call this ):
// Some random starting values
let weight = 3, bias = 5
const predict = () => {
// Let's hard-code x for now
const x = 10
return weight * x + bias
}
// 'h' is used to denote some notional tiny value
const h = 1
const predictionA = predict() // 3 * 10 + 5 = 35
// Now nudge w:
weight += h // 4
// How does the output change?
const predictionB = predict() // 4 * 10 + 5 = 45
const change = predictionB - predictionA // 10
// The effect that h had on the change:
// it went up by 10 (the value of x)
console.log( change / h ) // 10
So, when we change weight by a small amount h, the output changes by h * x.
This is a general principle: if we have A * B, any change to A will be
reflected in the output as change * B. The same rule applies to changes to
B: they are reflected in the output as change * A.
Let’s look at the bias:
let weight = 3, bias = 5
const predict = () =>{
const x = 10
return weight * x + bias
}
const h = 1
const predictionA = predict() // 3 * 10 + 5 = 35
// Now nudge bias:
bias += h // 5 + 1 = 6
// How does the output change?
const predictionB = predict() // 3 * 10 + 6 = 36
const change = predictionB - predictionA // 1
// The effect that h had on the change:
// it went up by 1 (the value of h)
console.log(change / h) // 1
We changed the bias by 1, and the output went up by 1. Again, there is a general
principle at play here. For any addition A + B, if we change A by h, the output
goes up by h.
So far, so good. We now know that for our expression y = weight * x + bias:
- if we nudge the
weightby a small amount (h), thenychanges byh * x. - If we nudge the
bias,ychanges byh.
In short, for both weight and bias, we can get a number that describes
exactly how much they contributed to the error. These numbers are referred to as the
gradients.
We can then change weight and bias using these gradients to make the loss go
down. This is the missing piece we need to programmatically adjust our model.
Let’s visualise this with some code:
Hyperparameters
The train function above takes a config object containing the parameters for
our training. These options are referred to as hyperparameters, since the model
itself has parameters (weight and bias). The hyperparameters will affect
the accuracy of our model in a big way, but are not part of the model itself.
You will notice a few things if you play around with the hyperparameters:
Step size (learning rate)
We have been using the term stepSize for the size of the jumps we make each
iteration of the training loop, because it makes sense with the analogy of
incrementally stepping down the ‘loss’ mountain.
However, it’s more properly referred to as the learning rate.
If the learning rate is too small (try changing it to 0.001), we never converge on the valley floor and reach our target loss.
On the other hand, chaos ensues if it is too large. Try changing it to a larger
number, say 1, or 10. You’ll notice we see something called ‘exploding gradients’.
This is when we rapidly approach the valley floor (due to the large rate of change),
but since our stepSize overshoots the valley floor we get stuck bouncing back
and forth until we run out of iterations.
The effect will be more pronounced if you make the following edit to the while
loop:
- while(i < maxIterations && loss > targetLoss) {
+ while(i < maxIterations) {
We overshoot, then overcorrect, ad infinitum.
Number of iterations (epochs)
We need a way to control our loop so that it doesn’t spin on forever, hence
maxIterations. In machine learning, each iteration of the training loop is
known as an epoch.
Obviously, if the learning rate is small, you will need more epochs to reach your target loss.
There are significant computational and financial costs to training models, which can have billions of parameters, to say nothing of the sheer time that training can take, so the decision of how many epochs to use needs to be made carefully.
Starting values
How we initialise the model parameters (weight and bias) massively affects
the outcome of training.
For reasons we will cover when we build bigger models, it’s important to be aware of the effect that initialising with zero has — it leads to something called the symmetry breaking problem, where all the gradients are the same.
On the other hand, we probably don’t want initial values that are too large. Try
setting startingValuesMinMax to [20,30], for example. We still converge, but
we waste epochs up front reducing them to more sensible values.
These are symptoms you will come to recognise, and are tied to our choices of values for
stepSize, maxIterations, and the starting values for the weights and biases.
There are ways to optimise your choices for these ‘hyperparameters’ (parameters that sit outside of the model itself), and getting a feel for how they affect training is a skill you learn through practice.
Generalisation
Above, our training loop used a fixed value for Y and X. This was fine for illustrating the idea of stepping down the ‘loss mountain’ towards a single optimal value.
In reality, however, we will need to be more thorough. How do we know that this X and Y value is representative of the dataset as a whole? It might be an outlier, and if we trained exclusively on this one value, it might throw our model off.
We need our model to generalise well - meaning the line of best fit accounts for all the data we have and allows us to use it to predict new values that fall within the average bounds.
Averaging things out protects us from a few things:
- you may have data points that are outliers, uncharacteristic, or which could skew the model
- the loss might be way off for one iteration or data point, and this could pull the gradients away from convergence
Averages in the data (normalisation)
If we take our data set and describe each data point in terms of the average, we will end up with a much more organised set of features. This is called normalisation.
There are several approaches to this, and the one we will use is called ‘mean-centering’.
We:
- calculate the average value for x (add them all up and divide by the number of them)
- then, when we want to use
x, we just express it in terms of how far off the average it is:x - mean.
We can then set this mean internally in the model so that it applies it to x
after training.
// mean is set internally during training
const trainedModel = train(...)
// Then, when running the trained model,
// it's using y = weight * (someNewValue - mean) + bias
trainedModel.predict(someNewValue)
Averages in the loss (Mean Squared Error)
We can hedge our bets with the loss, too, by using an average value across a
number of predictions. We also want a value that is always positive, since it
represents how ‘bad’ our model is. This means we can’t just use the error, since
error = prediction - y might be positive (overshot) or negative (undershot).
One way to do this is to use the Mean Squared Error mentioned above:
- Calculate each error: prediction - actual for each data point
- Square each error: This makes them positive and punishes larger errors
- Average them: Sum all squared errors and divide by the number of items
function meanSquaredError(predictions, actuals) {
const squaredErrors = predictions.map((pred, i) =>
Math.pow(pred - actuals[i], 2)
)
return squaredErrors.reduce((sum, err) => sum + err, 0) / predictions.length
}
Pulling it all together
We have covered a lot of ground. Let’s re-work our linear regression using:
- normalised data
- mean squared error for loss calculation
- setting the mean internally during training
I’ll also break the code out into different files to improve clarity.
Visualising the linear model
Finally, we can visualise how our linear model homes in on the line of best fit with the interactive component below.
Try changing the learning rate:
- higher: means more chance of oscillating or diverging gradients. The line will wobble and oscillate as we jump back and forth across the valley floor. Too large and it will overcorrect in a feedback loop until we run out of epochs.
- lower: means model might never converge - you’ll see the line crawl slowly towards the best fit. This requires more epochs to complete, which might have implications for time, cost, computational resources and so on.
You can change the speed of the animation using the ‘slow | fast’ slider.
Next
Next, we will look at a different problem set: what do we do when the data is not a straight line?