Part 3: The Multi-Layer Perceptron

Initially, the Perceptron generated huge excitement, and Rosenblatt was given enormous funding to develop his model. This led to a surge in AI development.

However, in 1969, Marvin Minsky and Seymour Papert published a book, Perceptrons, that identified a critical limitation: perceptrons could only classify linearly separable data.

Ironically, Minsky and Papert actually discussed the architecture needed to overcome this shortcoming in the same book — the multi-layer perceptron. However, they were skeptical that it could ever be trained. Funding dried up, and this led to the first ‘AI Winter’.

In fact, the training method actually existed by the mid-1970s, but such was the pessimism in the field, it was overlooked for another decade or so.

Today, the multi-layer perceptron (MLP) is a basic building block used inside many modern AI models.

In this chapter, we will:

learn about the challenge of nonlinearity, and the architecture needed to overcome it
write and train an MLP that allows us to model non-linear data.

Nonlinearity

The perceptron was a classification model, which means it was able to find a decision boundary between classes of things (like the letters T and L).

You will remember that the decision boundary was linear, meaning we could describe it with a straight line, using $y=mx+c$ .

But what happens if the boundary is not a straight line — what if it is non-linear?

The XOR problem

The canonical example of non-linear classification is the ‘XOR’ problem.

Let’s imagine two inputs, A and B, which are both binary. We want to output:

true if the inputs have different values
false if both inputs are the same

This is equivalent to the != operation in JavaScript.

Input A	Input B	Output
0	0	0
0	1	1
1	0	1
1	1	0

If we plot these on a chart, you’ll see there is no linear decision boundary that works to separate the two output classes (0 and 1) cleanly. Have a go:

Drag the green handles to move the decision boundary

2/4 correct (50%)

This is fairly abstract, but it’s important to understand that this problem extends to any dataset where the classes are not separated in a linear fashion. Here are some loose ‘real-world’ examples to visualise this.

brew time →

temperature →

Coffee brewing

● good● burnt/weak

days owned →

firmness →

Ripe avocados

● ripe● not ready

humidity →

temperature →

Comfortable weather

● comfortable● uncomfortable

water →

sunlight →

Plant survival

● thriving● dead

Hopefully it should be clear that single straight lines don’t work for lots of classes of data in the real world. But you can immediately see that it’s still possible to define a decision boundary:

a circle around the blue in the coffee and weather examples
two straight lines could be used to delineate a diagonal blue stripe in the avocado example
a semi-circular boundary separating the blue ‘half moon’ in the plant example.

So we need a way for our decision boundaries to curve.

Curved decision boundaries

A curve can be thought of mathematically as being composed of many straight lines, all at slightly different angles (tangents).

Straight lines at different angles form a curve

Extra lines give us the power to express complex boundaries. And, in fact, we only need one extra line to start finding solutions to the XOR problem:

Two lines create a stripe separating 1s from 0s

Two lines create quadrants separating 1s from 0s

So the question becomes: can we just add more perceptrons, and combine their outputs somehow?

Composing perceptrons

If we fed the output of one perceptron into another, forming one or more layers, how might this solve the problem?

Conceptually, one perceptron could learn to identify AND (both inputs are true), and the other could learn OR (one is true) — and we could compose these outputs:

function perceptron1(a, b) {
  return a || b  // learns OR
}

function perceptron2(a, b) {
  return a && b  // learns AND
}

function perceptron3(a, b) {
  // Composes both to learn XOR
  return perceptron1(a, b) && !perceptron2(a, b)
}

perceptron3(true, false) // true

Strictly speaking, this might not actually be exactly what the inner perceptrons end up learning, but the hidden layer will learn something that works.

Symmetry breaking (random initialisation)

In the last chapter, we noted that if a model’s initialisation is deterministic, we will get the same result (i.e. the same weights) every time.

So there is a real risk when we start to stack perceptrons that they will all arrive at the same decision boundary, rendering the exercise useless.

The answer is to initialise each one with random starting weights — in this way they will each find their own unique decision boundary.

So it’s important to note that when you layer perceptrons, they all find an answer, not the answer. We have no guarantee that one will learn AND and the other will learn OR, just that they will learn something that works.

The take-home is that we need to bake indeterminism into the model from the start, and ensure that all the parameters are initialised to random numbers.

A sketch of our model structure might look a bit like this:

function perceptron(dimensions) {
  // Code from previous chapter, making 
  // sure we initialise with random weights and bias
  let weights = Array.from({ length: dimensions }, getRandomFloat)
  let bias = getRandomFloat();

  return {
    predict(x) {
     // ...
    }
    //...
  }
}

const input = [true, false]
const dimensions = input.length

// The hidden layer
const hidden1 = perceptron(dimensions)
const hidden2 = perceptron(dimensions)
// The outer layer
const mlp = perceptron(dimensions)
 
const prediction = mlp.predict([
  hidden1.predict(input),
  hidden2.predict(input),
])

We now have the skeleton of the MLP architecture. However, we now need to understand why it was so challenging to train.

Activation functions revisited

In part two, we discussed activation functions — a function that takes the output of a perceptron and makes it useful in some way.

The original perceptron used the step function to convert the dot product into a binary output:

//...Inside our perceptron from part 2
    predict(x) {
      let dotProduct = 0;
      for (let i = 0; i < dimensions; i++) {
        dotProduct += weights[i] * x[i];
      }
      // Step activation
      return (dotProduct + bias) > 0 ? 1 : 0;
    },
//...

You’ll recall that the activation function, $σ$ , wraps the straight line formula, $y=mx+c$ , to give us a description of the model:

$\hat{y} = \sigma\left(w⋅x+ b\right)$

Activation functions also serve another important purpose when you start to layer linear functions: they stop them collapsing.

Collapsing linear functions

Let’s look at what actually happens when you try to compose linear functions like this.

Use the component below to add as many linear $y=mx+c$ layers as you like. The output of each is fed into the next, as with our MLP.

Layers (2)

y =x +

y =x +

Result

2 layers → y = 1.00x + 3.50 — still just a straight line

If you compose linear functions, they collapse into a single linear function. However many you add, you arrive back at $y=mx+c$ .

By the same token, with our perceptron in its current form, we can stack as many as we like, but we will never get a curve.

Now, see what happens when you use an activation function like the step function to wrap the linear equation:

steps (1)

step(x +)

Output

1 step → up to 2 discrete levels

We are now able to compose linear functions in a way that doesn’t collapse, and we have a way to model curves.

We can now, in principle, model any data if we have enough perceptrons — our decision boundary will shift and contort as needed to approximate the boundary.

So, we can see that the architecture of the perceptron supports layers, and that a multi-layer perceptron has everything it needs to model non-linearity. This was understood at the time.

What wasn’t understood, however, was how to train it.

Training the MLP

The perceptron was trained using the following learning rule:

function perceptron(dimensions) {
  let weights = ...
  let bias = ...
  // ...
  return {
      predict(x) { 
        // ...
      },
      // The perceptron learning rule:
      update(x, delta) {
        weights = weights.map(( w, i ) => 
          w + delta * x[i]
        )
        bias += delta;
      },
  }
}

// To train:
const predicted = model.predict(x);

if (predicted !== actual) {
  const delta = actual - predicted; // +1 or -1
  model.update(x, delta);
}

If the output is wrong, we nudge the weights in the direction of the error — up if we undershot, down if we overshot.

One issue is that this is a very blunt tool, and it doesn’t scale. Do we nudge every weight and bias in the network up or down at once?

The second issue is that we only have one error signal, which is the accuracy of the final prediction. But there is an inner game of pass-the-parcel we cannot see, with the outputs of the hidden perceptrons being the inputs for the outer one.

We have no idea how accurate the inner layer was, so have no error signal for those perceptrons to train them with.

const { input, expected } = data

const hidden1 = perceptron(2)
const hidden2 = perceptron(2)
const outer = perceptron(2)

const predicted = outer.predict([
  // Invisible calculations:
  hidden1.predict(input), 
  hidden2.predict(input),
])

// We can figure out the error for the outer
// layer because we have access to its output:
const error = predicted - expected
// ...But what is the error for hidden1 and hidden2?

It’s therefore hard to understand which weights / biases need adjusting and by how much. This is called the credit assignment problem.

It’s exacerbated by the step activation — even if we know that, for example, weight 1 in the second hidden perceptron is too low and needs nudging up, we don’t know if that small increase is going to have enough of an effect to result in the step function changing its output. Or, it might have the unintended effect of nudging a step function later in the calculation from 0 to 1 and breaking everything.

Solving credit assignment with backpropagation

Let’s park the step activation for the moment, and solve the credit assignment problem first.

As discussed, this breaks down into two parts:

We don’t have an error signal for each perceptron in the calculation — their output is the input for the outer layer, and we can’t see it. We don’t know how wrong they were, so we can’t improve them.
Within each perceptron, we need to figure out how each weight and the bias contributed to the error.

You may have already guessed this, but we have a tool we can use for the second part: the chain rule.

The chain rule gives us the ability to take each perceptron in the MLP and, if we know its inputs and outputs, evaluate how its weights and bias affected things.

// Assuming we know the input as well as the error,
// we can understand how each weight contributed.
function update(input, error) {
  // For y = weight * x + bias, nudging a weight
  // by h changes the output by h*x. This means: 
  //   ∂weight₁ = input₁ * error
  //   ∂weight₂ = input₂ * error
  // ...etc.
  const gradWs = input.map(x => x * error)
  // The ∂bias is just the error: nudge by h and error
  // increases by h
  const gradB = error

  // Apply the gradients to the weights and bias:
  weights = weights.map((w, i) => w - gradWs[i])
  bias = bias - gradB
}

This gets interesting when there are a number of $w·x + b$ operations feeding into each other, with the output for one being the input for the next, since the error for each will need to be calculated using the inputs from the next.

This means we need to ‘follow the blame backwards’. We trace the error back through the MLP, one perceptron at a time, working out how each weight contributed.

This is called backpropagation.

Our update method needs to calculate how the input affected the error, and return those gradients so that the previous perceptron can use them as its error argument:

function update(input, error) {
  // Getting the gradients for the inputs (the errors for
  // the PREVIOUS perceptron):
  //   ∂input₁ = weight₁ * error
  //   ∂input₂ = weight₂ * error...
  const gradInputs = weights.map(w => w * error)

  const gradWs = input.map(x => x * error)
  const gradB = error

  weights = weights.map((w, i) => w - gradWs[i])
  bias = bias - gradB

  // Return the input errors so they can be passed upstream
  return gradInputs
}

We can return this update along with the prediction, enabling us to cache the inputs. Then, we can work our way backwards, updating each perceptron in turn.

// Move the 'update' function _into_ the predict function, so it 
// can access the closure. Then return it along with the prediction
// as a callback.
function predict(input) {
  // Calculate prediction using σ(w⋅x+b)...
  return {
    prediction,
    update (error) {
      /* 
      As above, but now we get access to the input for a given 
      prediction via the closure.
      */
    }
  }
}

All we need to do now to backpropagate is to loop back over each perceptron in the network and call update using the input gradients returned by the previous one.

const { input, expected } = data

const hidden1 = perceptron(2)
const hidden2 = perceptron(2)
const output = perceptron(2)

const h1Output = hidden1.predict(input)
const h2Output = hidden2.predict(input)

const { prediction, update } = output.predict([
  h1Output.prediction,
  h2Output.prediction,
])

const error = prediction - expected

// Update outer layer, grab errors for hidden layer
const [
  hidden1Error,
  hidden2Error
] = output.update(error)

// hidden layer updates
h1Output.update(hidden1Error)
h2Output.update(hidden2Error)

We can now step back through the calculation, and — like a detective following footprints — figure out how each part contributed to the final output.

This solves the credit assignment problem — on to fixing the step activation.

Solving the step activation

We’ve used backpropagation to solve the credit assignment problem, but in order to do that, we ignored the activation function. We need to add that back in.

However, the step function gives us a real headache when we try to use gradients. Recall the graph of the step function — the lines that make it up are either flat or vertical.

Step: returns 1 if positive, else 0

When it comes to figuring out the gradients (which are all about how the output changes as tinier and tinier nudges, h, are applied to the input), we see that when the input is below zero, h has no effect:

let x = -0.001
step(x) // 0

x = -0.0001
step(x) // 0

x = -0.00001
step(x) // 0

x = -0.000001
step(x) // 0

The rate of change — the derivative — is 0.

Similarly, when x > 0, you remain on 1 with no rate of change.

Something bizarre happens when x === 0, though — we get the vertical line in the chart, meaning the output explodes:

// At the boundary:
let x = 0
const h = 0.0000001
step(x + h) // 1

// Rate of change (change in output) / (change in input)
const gradient = 1 / h // 10,000,000 !

As the nudge tends to zero, the rate of change becomes impossible to calculate. Even without the mathematics, you can see, visually, from the graph that there is no gradient in a step.

We need a different activation function.

ReLU

There are many activation functions that have been used since the perceptron, and I’m going to skip straight to the one used in modern models — the Rectified Linear Unit (ReLU):

function relu(x) {
  if(x > 0) {
    return x
  }
  return 0
}

It’s very simple, and its output looks like this:

ReLU: returns x if positive, else 0

It’s a tiny change from the step function:

 if(x > 0) {
-  return 1
+  return x
 }
 return 0

…however, it allows us to compute a gradient:

const h = 0.001
let x = 2
// When x > 0, output tracks input:
relu(2)     // 2
relu(x + h) // 2.001
// Gradient = 0.001 / 0.001 = 1

x = -2
// When x ≤ 0, output is stuck at 0:
relu(x)    // 0
relu(x + h) // input is -1.99, outputs 0
// Gradient = 0

It is also better at curves: when you use ReLU to wrap linear functions, you get smoother ‘kinks’ than with the step activation:

ReLUs (1)

ReLU(x +)

Output

1 ReLU → up to 1 kink in the output

Perceptrons vs Neurons

We have been using the term ‘perceptron’ because that is the historical name for the model we built — which has a step activation function. The more general term for a linear function with a non-linear activation is a neuron, so technically, now we are shifting away from the step function, we are no longer dealing with perceptrons, but neurons.

However, the term ‘Multi-layer Perceptron’ (MLP) was the one used to first describe this approach, and the model you get when you compose neurons in this way is still referred to as an MLP when it is discussed in the historical context (as we are doing).

The MLP is used as a building block in all modern models, but when it is, other names are used, such as linear layer, dense layer and fully-connected layer. The take-home is that these terms all mean the same thing.

For this article, we will now switch to the term neuron instead of perceptron (since, strictly speaking, that’s what it just became when we dropped the step function), but retain the term MLP since we are operating in the historical context.

Let’s replace the step function with ReLU, and rename our perceptron accordingly:

- function perceptron(dimensions) {
+ function neuron(dimensions) {
  let weights = Array.from({length: dimensions}, () => getRandomFloat())
  let bias = getRandomFloat()

+  // ReLU activation
+  function activate (x) {
+    return x > 0 ? x : 0
+  }

  return {
    predict(input) {
      let dotProduct = 0;
      for (let i = 0; i < input.length; i++) {
        dotProduct += weights[i] * input[i];
      }
      const weightedSum = dotProduct + bias
-     const prediction = weightedSum > 0 ? 1: 0
+     const prediction = activate(weightedSum)

We also need to factor in the ReLU when calculating the gradients in the update function.

update(error) {
  // Need to check if ReLU was active
  const reluGrad = weightedSum > 0 ? 1 : 0

  // Apply the reluGrad to the gradients we had before
  const gradInput = weights.map(w => w * error * reluGrad)
  const gradWs = input.map(x => x * error * reluGrad)
  const gradB = error * reluGrad
  // ...
}

One thing to note: now our neurons all use ReLU, including the output neuron. For binary classification, like XOR, we actually want to output 0 or 1, but ReLU will just output a positive number (like 47.3, or 0.02).

During training, we can work with these raw values—the gradients still flow correctly. But when we want to interpret the final predictions as classifications, we apply a threshold: outputs > 0 become class 1, otherwise class 0:

// Final predictions
const { prediction, update } = output.predict(
  // ...
)
// Convert to a binary classification
const classification = prediction > 0 ? 1 : 0

In practice, you’ll often want different layers to use different activation functions. For example, you may want probability outputs (between 0 and 1) or other specialized behaviors. Our neurons currently only support one calculation. In the next chapter, we’ll look at how to build this flexibility into our architecture.

Now we have everything we need to solve the XOR problem.

/**
* Using an MLP to solve the XOR problem
*/
import {data} from './data.js'
import {neuron} from './neuron.js'

// Our MLP: two hidden neurons and
// an output neuron
const hiddenLayer = [
 neuron(2),
 neuron(2),
]
const output = neuron(2);

// Training loop
function train(
  hyperParams = {
    learningRate: 0.1,
    epochs: 1000,
  }
) {
  const { epochs, learningRate } = hyperParams;

  for (let i = 0; i < epochs; i++) {
    let totalLoss = 0;

    for (let { x, y } of data) {
      const hiddenLayerOutput = hiddenLayer.map(n => n.predict(x))

      const { prediction, update } = output.predict([
        ...hiddenLayerOutput.map(( {prediction} ) => prediction)
      ]);

      const error = prediction - y;
      // Loss: use squared error to keep the number positive
      totalLoss += error * error;

      if (error) {
        const hiddenLayerErrors = update(error, learningRate);
        hiddenLayerErrors.map(
          (e, i) => hiddenLayerOutput[i].update(e, learningRate)
        )
      }
    }

    if (i % 50 === 0) {
      console.log('epoch ' + i + ', loss: ' + totalLoss.toFixed(4));
    }
  }
}

train();

// Test final predictions
console.log("Final predictions:");
for (let { x, y } of data) {
  const hiddenLayerOutput = hiddenLayer.map(n => n.predict(x))
  const out = output.predict([
    ...hiddenLayerOutput.map(( {prediction} ) => prediction)
  ]);

  // Apply threshold for binary classification
  const classification = out.prediction > 0 ? 1 : 0;

  console.log(
     'x: ' + x.join(',') +
     ', expected: ' + y +
     ', predicted: ' + classification +
     ' (raw: ' + out.prediction.toFixed(3) + ')'
   );
}

Notes on the code

Number of neurons

You’ll see that the code runs, but does not always solve the XOR correctly.

This is because there are only two neurons. If one neuron ends up in a position where w1*x1 + w2*x2 + bias < 0 for all inputs, then it cannot activate — it gets ‘stuck’. This is called a ‘dead’ neuron: it cannot be trained out of this position.

We can mitigate for this by initialising the bias with a positive number, so it makes it more likely that wx + b > 0, meaning the neuron can activate and learn.

Try adding a third hidden neuron to introduce some redundancy — this way, if one gets stuck, we still have enough neurons to create a decision boundary that works.

  // Line 9
  const hiddenLayer = [
   neuron(2),
   neuron(2),
+  neuron(2),
 ]
- const output = neuron(2);
+ const output = neuron(3);

Experiment a bit; you’ll see that we sometimes get stuck even with 3 neurons. However, increasing to 4 neurons usually means we have a really robust network.

The lesson here is that redundancy helps: more neurons means more lines to contribute to the curved decision boundary, and we can still recover if some neurons get stuck.

Hyperparameters

I’ve also added hyperparameters, as we covered in part 1, including number of epochs and a learning rate to control step size. Remember that the learning rate is the size of the ‘steps we take down the hill’:

too large means we will overshoot the valley and start climbing up the other side
too small means we never reach the valley and converge.

As you add neurons, you usually need to reduce the learning rate. With more weights and biases to update simultaneously, each update has greater potential to interfere with others.

In basic terms, you can think of this as many small changes (i.e. more nudges across more neurons) accumulating into a big change, so reducing the size of the nudge mitigates this.

Gradient calculation differences

In Part 1 processed all samples together as a batch — we normalised them using the mean, and used Mean Squared Error to calculate loss across the entire batch. This had implications:

When training, we accumulated gradients across all the samples in the batch and divided by the size: (e.g. gradW *= 2 / inputs.length).
When making final predictions, we had to normalise the inputs, since the model had been trained on normalised data. This meant baking the mean into the model using model.setMean().

Here, we’re running predict on one input at a time, and updating weights after each sample (this is known as online learning or stochastic gradient descent).

This means:

No averaging across samples (we update after each one)
The factor of 2 from squared error could be included, but since it’s a constant that applies to all weights, it’s effectively absorbed into the learning rate
Learning rates may need adjustment compared to Part 1 due to the different gradient scales.

Both approaches are valid. Batch processing might be better for larger datasets where averaging helps handle outliers; online learning works well here for simplicity. A lot of ML is empirical, so understanding different approaches helps you adapt to different scenarios.

Visualising convergence on non-linear data

Below is a visualisation of our model training on a synthetic coffee dataset. The ‘balanced’ cups are clustered in a target zone, while points further out are under- or over-extracted. Have a go at adjusting hidden neurons and learning rate, and watch how the boundary converges (or oscillates).

Training an MLP on synthetic coffee data

Toy dataset: 'balanced' cups sit near the center; farther points represent under/over extraction.

━ decision boundary● balanced● under/over

Hidden Neurons:8Learning Rate:0.025

Epoch: 0Loss: —Acc: —
Press Train to start.

Summary

The MLP is a key building block of many modern AI architectures, and with good reason: it brings together some powerful tools:

linear combinations of inputs, and the ability to create decision boundaries using lines
activation functions, which avoid linear collapse and allow the lines to be composed into curves
backpropagation, so that individual neurons in the network can be corrected and converge towards a correct output

If we add sufficient neurons to our network, we can approximate extremely complex boundaries and, theoretically, this means we can model almost any data — this is known as Universal Approximation.

Ingredient	What you get
Stacking without non-linearity	Useless — collapses to single layer
Non-linearity without backprop	Works in theory, untrainable in practice
Backprop without non-linearity	Just a linear model with extra steps
All three together	Universal approximation

However, there are practical limits to the depth of layers like this. As you add more neurons, you:

need a smaller learning rate
increase training time and compute (and expense!)

These trade-offs become very real when, for example, you need to classify an image containing hundreds or thousands of pixels. Each pixel is an input, but not every pixel carries useful information about the image.

Maybe we need layers that can filter out information — that _don’t use the $y=mx+c$ equation for their calculations. Maybe we want to swap in a different activation function here or there, or remove it altogether.

Right now, however, the calculations are baked into the neurons. In order to enable the flexibility we will need, we will have to abstract this away. This is how modern ML libraries like TensorFlow and PyTorch work.

In the next chapter, we will build an autograd engine from scratch that will allow us to do exactly this.