Part 1: Linear Regression

The Line of Best Fit

Ultimately, machine learning is about generating a statistical model of a given dataset, so that you can make predictions about where new data points are likely to be.

A useful place to start is regression, which is the process of finding a line of best fit with some data.

Let’s imagine a dataset that maps the relationship between hours of sleep and cups of coffee consumed:

Hours of sleep	Cups of coffee
4	4.2
5	3.5
5.5	3.1
…	…

Just by eyeballing the data, you can see a trend: hours of sleep go up as cups of coffee go down.

We can visualise this trend by plotting the data on a chart and drawing a line that passes as close to each data point as possible.

We can then use this line to predict new values that were not in the original data (e.g. 4.5 hours of sleep).

Have a play with the chart by hovering over the line to see the values it predicts.

Hours of Sleep vs. Cups of Coffee

Line: y = -0.76x + 7.24

The straight line formula

A straight line on a chart is described by $y = mx + c$ , a formula you probably learned in school. It describes how any given values for $x$ and $y$ relate to each other, using:

$m$ (the slope of the line)
$c$ (the point at which the line intercepts the Y-axis).

Have a play with the sliders and see how $m$ and $c$ affect the line. You can also change the formula to see why, for example, m**x - c or m/x + c don’t work.

m = 2.0c = 1

Finding the best fit

Putting this together, if we can establish what $m$ and $c$ are, we can generate a straight line to make a line of best fit for our data. This will allow us to predict $y$ for any possible $x$ .

See if you can find the best $m$ and $c$ to get as close to the data as possible.

Hours of Sleep vs. Cups of Coffee Consumed

m (slope) = 0.00c (intercept) = 2.17

Loss: 1.33

Show best fit

We have manually created our first linear model. The model (line of best fit) allows us to predict novel, unseen values based on data.

The Linear Model

The linear model is the LEGO brick of machine learning. Its job is to model the line of best fit - simple as that. We pre-load it with our best values for $m$ and $c$ , and then we can give it any $x$ and it will spit out a $y$ .

The code for a simple linear model might look like this:

function linearModel() {
  // Edit these to match your best m and c from 
  // the chart above
  let weight = -0.76, bias = 7.31

  return {
    predict(x) {
      return weight * x + bias
    },
  }
}

const model = linearModel()

// Make a prediction
const hoursSleep = 3
const prediction = model.predict(hoursSleep)

console.log(
  hoursSleep + " hours of sleep = " + 
  prediction + " cups of coffee"
)

Training the Linear Model

Obviously, we do not know the optimal values for the weight (slope) and bias (intercept) up front, so we need a way to move the line of best fit around programmatically until we arrive at the right values.

The approach we use to do this is called Linear Regression.

For this, we need the following:

a model of the data that makes predictions
a way to quantify how correct the predictions are (mentioned earlier: this is called the loss)
a way to use this information to re-calibrate the model to make better predictions

If we have those three ingredients, we can run a loop to build our model of the data - see the pseudocode below:

// Create our model
const model = linearModel()

// Starting loss - the aim is to whittle this down
let loss = Infinity
// Set the loss that we will be happy with
const targetLoss = 0.01
// Set some sort of ceiling on the number of attempts
const maxIterations = 100
let i = 0

while(i < maxIterations && loss > targetLoss) {
  data.forEach(({ x: input, y: actual}) => {
    // Make a prediction for Y
    const prediction = model.predict(input)
    // Compare to the actual value of Y and figure out how
    // wrong we were (loss)
    loss = calculateLoss(prediction, actual)

    // The missing piece! Use the loss to adjust weights and biases
    model.adjust(loss)
  })

  i ++
}

// Now we should have a model that makes accurate predictions
model.predict(someNewValue) // ✨ accurate prediction!

This process is referred to as training - we run the model, look at how wrong its prediction is (the loss), and tweak the weight and bias accordingly to reduce the loss. Rinse and repeat until a target loss is reached.

Only one missing piece now: The technique used to adjust the model based on the loss, which is called gradient descent.

Gradient Descent

So, the task is as follows: we ran y = weight * x + bias, and we know the result was off by a certain amount. We need a way to change weight and bias and reduce the loss.

The first thing we need to do is figure out how much the weight and the bias each contributed to the loss. Obviously, since one was multiplying by $x$ and the other was an addition, their contributions will be different.

Luckily, there is a branch of mathematics all about how tiny changes affect an output — calculus — and we can cherry pick a couple of simple principles that will help us.

Stepping back, let’s think about how the output changes when we nudge the weight by a small amount (we call this $h$ ):

// Some random starting values
let weight = 3, bias = 5 

const predict = () => { 
  // Let's hard-code x for now
  const x = 10
  return weight * x + bias
}

// 'h' is used to denote some notional tiny value
const h = 1

const predictionA = predict() // 3 * 10 + 5 = 35

// Now nudge w:
weight += h // 4

// How does the output change?
const predictionB = predict() // 4 * 10 + 5 = 45
const change = predictionB - predictionA // 10

// The effect that h had on the change:
// it went up by 10 (the value of x)
console.log( change / h ) // 10

So, when we change weight by a small amount h, the output changes by h * x.

This is a general principle: if we have A * B, any change to A will be reflected in the output as change * B. The same rule applies to changes to B: they are reflected in the output as change * A.

Let’s look at the bias:

let weight = 3, bias = 5 

const predict = () =>{ 
  const x = 10
  return weight * x + bias
}

const h = 1

const predictionA = predict() // 3 * 10 + 5 = 35

// Now nudge bias:
bias += h // 5 + 1 = 6

// How does the output change?
const predictionB = predict() // 3 * 10 + 6 = 36
const change = predictionB - predictionA // 1

// The effect that h had on the change:
// it went up by 1 (the value of h)
console.log(change / h) // 1

We changed the bias by 1, and the output went up by 1. Again, there is a general principle at play here. For any addition A + B, if we change A by h, the output goes up by h.

So far, so good. We now know that for our expression y = weight * x + bias:

if we nudge the weight by a small amount (h), then y changes by h * x.
If we nudge the bias, y changes by h.

In short, for both weight and bias, we can get a number that describes exactly how much they contributed to the error. These numbers are referred to as the gradients.

We can then change weight and bias using these gradients to make the loss go down. This is the missing piece we need to programmatically adjust our model.

Let’s visualise this with some code:

import { getRandomFloat } from './utils.js'

/**
* Our linear model, with a couple of tweaks:
* - an error function that returns a method for adjusting 
* the weights using gradient descent.
* - a 'parameters' function that returns the model 
* parameters (weight and bias).
* 
* The model also takes a 'startingValuesMinMax' tuple,
* so we can control the initial values of our parameters.
*/
function linearModel(startingValuesMinMax) {
  // Initialise the parameters
  let weight = getRandomFloat(...startingValuesMinMax)
  let bias = getRandomFloat(...startingValuesMinMax)
  
  return {
      predict(x) { 
          return weight * x + bias
      },
      // Returns the error and a method to adjust based on it
      error(x, y, prediction) {
          const error = prediction - y
  
          return {
              value: error,
              // A method for improving the weights to reduce the error
              adjust(stepSize) {
                  // Calculate the gradients
                  // for weight and bias
                  const gradW = error * x
                  const gradB = error

                  // Nudge weight and bias by stepSize
                  // according to their gradients 
                  weight = weight - stepSize * gradW
                  bias = bias - stepSize * gradB
              }
          }
      },
      // A method for inspecting the parameters
      parameters: () => ( { weight, bias } )
    }
}

/**
* A function to train our model
*/
function train(hyperParameters, getModel, data) {
  const { x, y } = data
  
  const { 
      stepSize, 
      targetLoss, 
      maxIterations,
      startingValuesMinMax,
  } = hyperParameters

  const model = getModel(startingValuesMinMax)

  let i = 0
  let error = Infinity
  
  // We use Math.abs() here because error can be positive or negative.
  // For now, we'll treat this absolute error as a simple 'loss' metric.
  while(i < maxIterations && Math.abs(error) > targetLoss) {
      let prediction = model.predict(x)
      const {value, adjust} = model.error(x, y, prediction)
       
      error = value

      console.log(
          {i, prediction, error}
      )
      
      adjust(stepSize)
      i ++
  }

  return model
}

// Let's just use one datapoint for now
const data = { x: 4, y: 4.2 }
// Some config for our training
const hyperParameters = {
  stepSize: 0.01,
  targetLoss: 0.01,
  maxIterations: 100,
  startingValuesMinMax: [1, 10]
}

const model = train(hyperParameters, linearModel, data)
console.log(
  'final parameters: ', model.parameters(),
  'Final prediction: y = ' + model.predict(data.x)
)

Hyperparameters

The train function above takes a config object containing the parameters for our training. These options are referred to as hyperparameters, since the model itself has parameters (weight and bias). The hyperparameters will affect the accuracy of our model in a big way, but are not part of the model itself.

You will notice a few things if you play around with the hyperparameters:

Step size (learning rate)

We have been using the term stepSize for the size of the jumps we make each iteration of the training loop, because it makes sense with the analogy of incrementally stepping down the ‘loss’ mountain.

However, it’s more properly referred to as the learning rate.

If the learning rate is too small (try changing it to 0.001), we never converge on the valley floor and reach our target loss.

On the other hand, chaos ensues if it is too large. Try changing it to a larger number, say 1, or 10. You’ll notice we see something called ‘exploding gradients’. This is when we rapidly approach the valley floor (due to the large rate of change), but since our stepSize overshoots the valley floor we get stuck bouncing back and forth until we run out of iterations.

The effect will be more pronounced if you make the following edit to the while loop:

- while(i < maxIterations && loss > targetLoss) {
+ while(i < maxIterations) {

We overshoot, then overcorrect, ad infinitum.

Number of iterations (epochs)

We need a way to control our loop so that it doesn’t spin on forever, hence maxIterations. In machine learning, each iteration of the training loop is known as an epoch.

Obviously, if the learning rate is small, you will need more epochs to reach your target loss.

There are significant computational and financial costs to training models, which can have billions of parameters, to say nothing of the sheer time that training can take, so the decision of how many epochs to use needs to be made carefully.

Starting values

How we initialise the model parameters (weight and bias) massively affects the outcome of training.

For reasons we will cover when we build bigger models, it’s important to be aware of the effect that initialising with zero has — it leads to something called the symmetry breaking problem, where all the gradients are the same.

On the other hand, we probably don’t want initial values that are too large. Try setting startingValuesMinMax to [20,30], for example. We still converge, but we waste epochs up front reducing them to more sensible values.

These are symptoms you will come to recognise, and are tied to our choices of values for stepSize, maxIterations, and the starting values for the weights and biases.

There are ways to optimise your choices for these ‘hyperparameters’ (parameters that sit outside of the model itself), and getting a feel for how they affect training is a skill you learn through practice.

Generalisation

Above, our training loop used a fixed value for Y and X. This was fine for illustrating the idea of stepping down the ‘loss mountain’ towards a single optimal value.

In reality, however, we will need to be more thorough. How do we know that this X and Y value is representative of the dataset as a whole? It might be an outlier, and if we trained exclusively on this one value, it might throw our model off.

We need our model to generalise well - meaning the line of best fit accounts for all the data we have and allows us to use it to predict new values that fall within the average bounds.

Averaging things out protects us from a few things:

you may have data points that are outliers, uncharacteristic, or which could skew the model
the loss might be way off for one iteration or data point, and this could pull the gradients away from convergence

Averages in the data (normalisation)

If we take our data set and describe each data point in terms of the average, we will end up with a much more organised set of features. This is called normalisation.

There are several approaches to this, and the one we will use is called ‘mean-centering’.

We:

calculate the average value for x (add them all up and divide by the number of them)
then, when we want to use x, we just express it in terms of how far off the average it is: x - mean.

We can then set this mean internally in the model so that it applies it to x after training.

// mean is set internally during training
const trainedModel = train(...) 
// Then, when running the trained model, 
// it's using y = weight * (someNewValue - mean) + bias
trainedModel.predict(someNewValue)

Averages in the loss (Mean Squared Error)

We can hedge our bets with the loss, too, by using an average value across a number of predictions. We also want a value that is always positive, since it represents how ‘bad’ our model is. This means we can’t just use the error, since error = prediction - y might be positive (overshot) or negative (undershot).

One way to do this is to use the Mean Squared Error mentioned above:

Calculate each error: prediction - actual for each data point
Square each error: This makes them positive and punishes larger errors
Average them: Sum all squared errors and divide by the number of items

function meanSquaredError(predictions, actuals) {
  const squaredErrors = predictions.map((pred, i) => 
    Math.pow(pred - actuals[i], 2)
  )
  return squaredErrors.reduce((sum, err) => sum + err, 0) / predictions.length
}

Pulling it all together

We have covered a lot of ground. Let’s re-work our linear regression using:

normalised data
mean squared error for loss calculation
setting the mean internally during training

I’ll also break the code out into different files to improve clarity.

import { linearModel } from './linear-model.js'
import {data} from './data.js'

function train(hyperParameters, getModel, data) {
const { learningRate, targetLoss, maxEpochs, startingValuesMinMax } =
  hyperParameters;

const model = getModel(startingValuesMinMax);

// Bake mean into model for automatic normalization
const mean = data.reduce((sum, d) => sum + d.x, 0) / data.length;
model.setMean(mean);

const inputs = data.map((d) => d.x);
const actuals = data.map((d) => d.y);

let epoch = 0;
let loss = Infinity;

while (epoch < maxEpochs && loss > targetLoss) {
  const predictions = inputs.map((x) => model.predict(x));
  const { value, adjust } = model.loss(inputs, predictions, actuals);
  loss = value;

  if (epoch % 50 === 0) {
    console.log({ epoch, loss: loss.toFixed(4) });
  }

  adjust(learningRate);
  epoch++;
}

console.log({ epoch, loss: loss.toFixed(4) });
return model;
}

const hyperParameters = {
  learningRate: 0.01,
  targetLoss: 0.01,
  maxEpochs: 1000,
  startingValuesMinMax: [-1, 1],
};

const model = train(hyperParameters, linearModel, data);
const { weight, bias } = model.parameters();
console.log(
  'Prediction for x=4: ' + 
  model.predict(4).toFixed(2) + 
  ' (actual: 4.2)'
);

Visualising the linear model

Finally, we can visualise how our linear model homes in on the line of best fit with the interactive component below.

Try changing the learning rate:

higher: means more chance of oscillating or diverging gradients. The line will wobble and oscillate as we jump back and forth across the valley floor. Too large and it will overcorrect in a feedback loop until we run out of epochs.
lower: means model might never converge - you’ll see the line crawl slowly towards the best fit. This requires more epochs to complete, which might have implications for time, cost, computational resources and so on.

You can change the speed of the animation using the ‘slow | fast’ slider.

Watch Gradient Descent Converge

Learning Rate: 0.10SlowFast

Epoch: 0Loss: —

Next, we will look at a different problem set: what do we do when the data is not a straight line?