← All Notebooks

Part 1: Linear Regression

A hands-on introduction to the building blocks of neural networks

The Line of Best Fit

Ultimately, machine learning is about generating a statistical model of a given dataset, so that you can make predictions about where new data points are likely to be.

A useful place to start is regression, which is the process of finding a line of best fit with some data.

Let’s imagine a dataset that maps the relationship between hours of sleep and cups of coffee consumed:

Hours of sleepCups of coffee
44.2
53.5
5.53.1

Just by eyeballing the data, you can see a trend: hours of sleep go up as cups of coffee go down.

We can visualise this trend by plotting the data on a chart and drawing a line that passes as close to each data point as possible.

We can then use this line to predict new values that were not in the original data (e.g. 4.5 hours of sleep).

Have a play with the chart by hovering over the line to see the values it predicts.

Hours of Sleep vs. Cups of Coffee
Line: y = -0.76x + 7.24

The straight line formula

A straight line on a chart is described by y=mx+cy = mx + c, a formula you probably learned in school. It describes how any given values for xx and yy relate to each other, using:

Have a play with the sliders and see how mm and cc affect the line. You can also change the formula to see why, for example, m**x - c or m/x + c don’t work.

Finding the best fit

Putting this together, if we can establish what mm and cc are, we can generate a straight line to make a line of best fit for our data. This will allow us to predict yy for any possible xx.

See if you can find the best mm and cc to get as close to the data as possible.

Hours of Sleep vs. Cups of Coffee Consumed
Loss: 1.33

We have manually created our first linear model. The model (line of best fit) allows us to predict novel, unseen values based on data.

The Linear Model

The linear model is the LEGO brick of machine learning. Its job is to model the line of best fit - simple as that. We pre-load it with our best values for mm and cc, and then we can give it any xx and it will spit out a yy.

The code for a simple linear model might look like this:

function linearModel() {
  // Edit these to match your best m and c from 
  // the chart above
  let weight = -0.76, bias = 7.31

  return {
    predict(x) {
      return weight * x + bias
    },
  }
}

const model = linearModel()

// Make a prediction
const hoursSleep = 3
const prediction = model.predict(hoursSleep)

console.log(
  hoursSleep + " hours of sleep = " + 
  prediction + " cups of coffee"
)

Training the Linear Model

Obviously, we do not know the optimal values for the weight (slope) and bias (intercept) up front, so we need a way to move the line of best fit around programmatically until we arrive at the right values.

The approach we use to do this is called Linear Regression.

For this, we need the following:

If we have those three ingredients, we can run a loop to build our model of the data - see the pseudocode below:

// Create our model
const model = linearModel()

// Starting loss - the aim is to whittle this down
let loss = Infinity
// Set the loss that we will be happy with
const targetLoss = 0.01
// Set some sort of ceiling on the number of attempts
const maxIterations = 100
let i = 0

while(i < maxIterations && loss > targetLoss) {
  data.forEach(({ x: input, y: actual}) => {
    // Make a prediction for Y
    const prediction = model.predict(input)
    // Compare to the actual value of Y and figure out how
    // wrong we were (loss)
    loss = calculateLoss(prediction, actual)

    // The missing piece! Use the loss to adjust weights and biases
    model.adjust(loss)
  })

  i ++
}

// Now we should have a model that makes accurate predictions
model.predict(someNewValue) // ✨ accurate prediction!

This process is referred to as training - we run the model, look at how wrong its prediction is (the loss), and tweak the weight and bias accordingly to reduce the loss. Rinse and repeat until a target loss is reached.

Only one missing piece now: The technique used to adjust the model based on the loss, which is called gradient descent.

Gradient Descent

So, the task is as follows: we ran y = weight * x + bias, and we know the result was off by a certain amount. We need a way to change weight and bias and reduce the loss.

The first thing we need to do is figure out how much the weight and the bias each contributed to the loss. Obviously, since one was multiplying by xx and the other was an addition, their contributions will be different.

Luckily, there is a branch of mathematics all about how tiny changes affect an output — calculus — and we can cherry pick a couple of simple principles that will help us.

Stepping back, let’s think about how the output changes when we nudge the weight by a small amount (we call this hh):

// Some random starting values
let weight = 3, bias = 5 

const predict = () => { 
  // Let's hard-code x for now
  const x = 10
  return weight * x + bias
}

// 'h' is used to denote some notional tiny value
const h = 1

const predictionA = predict() // 3 * 10 + 5 = 35

// Now nudge w:
weight += h // 4

// How does the output change?
const predictionB = predict() // 4 * 10 + 5 = 45
const change = predictionB - predictionA // 10

// The effect that h had on the change:
// it went up by 10 (the value of x)
console.log( change / h ) // 10

So, when we change weight by a small amount h, the output changes by h * x.

This is a general principle: if we have A * B, any change to A will be reflected in the output as change * B. The same rule applies to changes to B: they are reflected in the output as change * A.

Let’s look at the bias:

let weight = 3, bias = 5 

const predict = () =>{ 
  const x = 10
  return weight * x + bias
}

const h = 1

const predictionA = predict() // 3 * 10 + 5 = 35

// Now nudge bias:
bias += h // 5 + 1 = 6

// How does the output change?
const predictionB = predict() // 3 * 10 + 6 = 36
const change = predictionB - predictionA // 1

// The effect that h had on the change:
// it went up by 1 (the value of h)
console.log(change / h) // 1

We changed the bias by 1, and the output went up by 1. Again, there is a general principle at play here. For any addition A + B, if we change A by h, the output goes up by h.

So far, so good. We now know that for our expression y = weight * x + bias:

In short, for both weight and bias, we can get a number that describes exactly how much they contributed to the error. These numbers are referred to as the gradients.

We can then change weight and bias using these gradients to make the loss go down. This is the missing piece we need to programmatically adjust our model.

Let’s visualise this with some code:

import { getRandomFloat } from './utils.js'

/**
* Our linear model, with a couple of tweaks:
* - an error function that returns a method for adjusting 
* the weights using gradient descent.
* - a 'parameters' function that returns the model 
* parameters (weight and bias).
* 
* The model also takes a 'startingValuesMinMax' tuple,
* so we can control the initial values of our parameters.
*/
function linearModel(startingValuesMinMax) {
  // Initialise the parameters
  let weight = getRandomFloat(...startingValuesMinMax)
  let bias = getRandomFloat(...startingValuesMinMax)
  
  return {
      predict(x) { 
          return weight * x + bias
      },
      // Returns the error and a method to adjust based on it
      error(x, y, prediction) {
          const error = prediction - y
  
          return {
              value: error,
              // A method for improving the weights to reduce the error
              adjust(stepSize) {
                  // Calculate the gradients
                  // for weight and bias
                  const gradW = error * x
                  const gradB = error

                  // Nudge weight and bias by stepSize
                  // according to their gradients 
                  weight = weight - stepSize * gradW
                  bias = bias - stepSize * gradB
              }
          }
      },
      // A method for inspecting the parameters
      parameters: () => ( { weight, bias } )
    }
}

/**
* A function to train our model
*/
function train(hyperParameters, getModel, data) {
  const { x, y } = data
  
  const { 
      stepSize, 
      targetLoss, 
      maxIterations,
      startingValuesMinMax,
  } = hyperParameters

  const model = getModel(startingValuesMinMax)

  let i = 0
  let error = Infinity
  
  // We use Math.abs() here because error can be positive or negative.
  // For now, we'll treat this absolute error as a simple 'loss' metric.
  while(i < maxIterations && Math.abs(error) > targetLoss) {
      let prediction = model.predict(x)
      const {value, adjust} = model.error(x, y, prediction)
       
      error = value

      console.log(
          {i, prediction, error}
      )
      
      adjust(stepSize)
      i ++
  }

  return model
}

// Let's just use one datapoint for now
const data = { x: 4, y: 4.2 }
// Some config for our training
const hyperParameters = {
  stepSize: 0.01,
  targetLoss: 0.01,
  maxIterations: 100,
  startingValuesMinMax: [1, 10]
}

const model = train(hyperParameters, linearModel, data)
console.log(
  'final parameters: ', model.parameters(),
  'Final prediction: y = ' + model.predict(data.x)
)

Hyperparameters

The train function above takes a config object containing the parameters for our training. These options are referred to as hyperparameters, since the model itself has parameters (weight and bias). The hyperparameters will affect the accuracy of our model in a big way, but are not part of the model itself.

You will notice a few things if you play around with the hyperparameters:

Step size (learning rate)

We have been using the term stepSize for the size of the jumps we make each iteration of the training loop, because it makes sense with the analogy of incrementally stepping down the ‘loss’ mountain.

However, it’s more properly referred to as the learning rate.

If the learning rate is too small (try changing it to 0.001), we never converge on the valley floor and reach our target loss.

On the other hand, chaos ensues if it is too large. Try changing it to a larger number, say 1, or 10. You’ll notice we see something called ‘exploding gradients’. This is when we rapidly approach the valley floor (due to the large rate of change), but since our stepSize overshoots the valley floor we get stuck bouncing back and forth until we run out of iterations.

The effect will be more pronounced if you make the following edit to the while loop:

- while(i < maxIterations && loss > targetLoss) {
+ while(i < maxIterations) {

We overshoot, then overcorrect, ad infinitum.

Number of iterations (epochs)

We need a way to control our loop so that it doesn’t spin on forever, hence maxIterations. In machine learning, each iteration of the training loop is known as an epoch.

Obviously, if the learning rate is small, you will need more epochs to reach your target loss.

There are significant computational and financial costs to training models, which can have billions of parameters, to say nothing of the sheer time that training can take, so the decision of how many epochs to use needs to be made carefully.

Starting values

How we initialise the model parameters (weight and bias) massively affects the outcome of training.

For reasons we will cover when we build bigger models, it’s important to be aware of the effect that initialising with zero has — it leads to something called the symmetry breaking problem, where all the gradients are the same.

On the other hand, we probably don’t want initial values that are too large. Try setting startingValuesMinMax to [20,30], for example. We still converge, but we waste epochs up front reducing them to more sensible values.

These are symptoms you will come to recognise, and are tied to our choices of values for stepSize, maxIterations, and the starting values for the weights and biases.

There are ways to optimise your choices for these ‘hyperparameters’ (parameters that sit outside of the model itself), and getting a feel for how they affect training is a skill you learn through practice.

Generalisation

Above, our training loop used a fixed value for Y and X. This was fine for illustrating the idea of stepping down the ‘loss mountain’ towards a single optimal value.

In reality, however, we will need to be more thorough. How do we know that this X and Y value is representative of the dataset as a whole? It might be an outlier, and if we trained exclusively on this one value, it might throw our model off.

We need our model to generalise well - meaning the line of best fit accounts for all the data we have and allows us to use it to predict new values that fall within the average bounds.

Averaging things out protects us from a few things:

Averages in the data (normalisation)

If we take our data set and describe each data point in terms of the average, we will end up with a much more organised set of features. This is called normalisation.

There are several approaches to this, and the one we will use is called ‘mean-centering’.

We:

We can then set this mean internally in the model so that it applies it to x after training.

// mean is set internally during training
const trainedModel = train(...) 
// Then, when running the trained model, 
// it's using y = weight * (someNewValue - mean) + bias
trainedModel.predict(someNewValue)

Averages in the loss (Mean Squared Error)

We can hedge our bets with the loss, too, by using an average value across a number of predictions. We also want a value that is always positive, since it represents how ‘bad’ our model is. This means we can’t just use the error, since error = prediction - y might be positive (overshot) or negative (undershot).

One way to do this is to use the Mean Squared Error mentioned above:

  1. Calculate each error: prediction - actual for each data point
  2. Square each error: This makes them positive and punishes larger errors
  3. Average them: Sum all squared errors and divide by the number of items
function meanSquaredError(predictions, actuals) {
  const squaredErrors = predictions.map((pred, i) => 
    Math.pow(pred - actuals[i], 2)
  )
  return squaredErrors.reduce((sum, err) => sum + err, 0) / predictions.length
}

Pulling it all together

We have covered a lot of ground. Let’s re-work our linear regression using:

I’ll also break the code out into different files to improve clarity.

import { linearModel } from './linear-model.js'
import {data} from './data.js'

function train(hyperParameters, getModel, data) {
const { learningRate, targetLoss, maxEpochs, startingValuesMinMax } =
  hyperParameters;

const model = getModel(startingValuesMinMax);

// Bake mean into model for automatic normalization
const mean = data.reduce((sum, d) => sum + d.x, 0) / data.length;
model.setMean(mean);

const inputs = data.map((d) => d.x);
const actuals = data.map((d) => d.y);

let epoch = 0;
let loss = Infinity;

while (epoch < maxEpochs && loss > targetLoss) {
  const predictions = inputs.map((x) => model.predict(x));
  const { value, adjust } = model.loss(inputs, predictions, actuals);
  loss = value;

  if (epoch % 50 === 0) {
    console.log({ epoch, loss: loss.toFixed(4) });
  }

  adjust(learningRate);
  epoch++;
}

console.log({ epoch, loss: loss.toFixed(4) });
return model;
}

const hyperParameters = {
  learningRate: 0.01,
  targetLoss: 0.01,
  maxEpochs: 1000,
  startingValuesMinMax: [-1, 1],
};

const model = train(hyperParameters, linearModel, data);
const { weight, bias } = model.parameters();
console.log(
  'Prediction for x=4: ' + 
  model.predict(4).toFixed(2) + 
  ' (actual: 4.2)'
);

Visualising the linear model

Finally, we can visualise how our linear model homes in on the line of best fit with the interactive component below.

Try changing the learning rate:

You can change the speed of the animation using the ‘slow | fast’ slider.

Watch Gradient Descent Converge
Epoch: 0Loss:

Next

Next, we will look at a different problem set: what do we do when the data is not a straight line?