← All Notebooks

Part 5: The Convolutional Neural Network

Sliding windows applied to neural nets

In the previous chapter, we built out an example autograd engine that will allow us to separate the concerns of the architecture (how the neurons are laid out to solve different problems) and the computation (dealing with the maths).

Let’s look at a new kind of architecture that illustrates the use case: the Convolutional Neural Network (CNN).

The limitations of the MLP

While very powerful, the MLP (also known as a ‘dense’, ‘fully connected’ or ‘linear’ layer) has some limitations.

First, all the neurons in each layer are connected with all the neurons in the next. This means that if you want lots of parameters, you need a lot of time and compute to train the model.

As an example, consider our T vs L classifier from chapter 2 - we used an array of 16 pixels for our input:

const T = [
 1,1,1,1,
 0,1,0,0,
 0,1,0,0,
 0,1,0,0,
]

That’s 16 inputs. What if we needed to scale up to an image that was a more realistic size, like 200 pixels square? Suddenly that’s 200 * 200 = 4000 inputs. And in real life, images are multi-dimensional: they have layers corresponding to the colour channels: red, green and blue. So a full colour image at 200 px square would require 200 * 200 * 3 = 120,000 inputs. You can see how this gets intractable very quickly.

The second issue is that MLPs are not spatially aware. If we imagine a classifier that needs to identify whether a picture contains a cat, we can imagine the MLP would easily learn that a given area of a given image contains the head of a cat. However, this is not transferrable: it wouldn’t necessarily be able to detect the head of a cat somewhere else in the image.

This is known as ‘translational invariance’ — for some forms of data (e.g. tabular data), location is everything, and a feature that is not in a given place is not a feature. For other forms of data, like images or sound waves, a feature can be valid anywhere in the input.

We need an architecture that can take a feature like fur or ears, and spot it anywhere in the image.

Convolution

Convolution is a mathematical process where an operation is applied repeatedly across two or more lists of inputs, with the goal of outtputing a new list that describes a sort of superimposed version of the inputs.

It is used a lot in calculations for probability and for image processing.

In programming, you will run into this idea whenever you need to use a ‘sliding window’ — we move a fixed-size window across data, one step at a time, and do something at each position.

// Use a sliding window to calculate the average of each 'patch' of 
// values in the input array.
function movingAverage(values, windowSize) {
  const result = []
  for (let i = 0; i <= values.length - windowSize; i++) {
    const patch = values.slice(i, i + windowSize)
    const sum = patch.reduce((acc, val) => acc + val, 0)
    const average = sum / windowSize
    result.push(average)
  }
  return result
}

const signal = [2, 4, 6, 8, 10, 12, 14]
const smoothed = movingAverage(signal, 3) // [4, 6, 8, 10, 12]

You can imagine applying any sort of useful mathematical operation to a stream of numbers in this way.

In the component below, you can toggle between three kernels (windows) containing pre-set numbers (weights) that are applied across the input to produce an output:

Weights:
Input signal
position 30: 0.33 × -0.55 + 0.33 × -0.76 + 0.33 × -0.96 = -0.75
Output

Equal weights → averages the 3 values in the window, smoothing the signal

You may be starting to sense that we could apply this in some way to image classification, and you’d be right: What if we could train the kernel?

A CNN is simply this, i.e. the application of ML to this idea of sliding a fixed kernel across an input. If the kernel understands which slice of the input is a cat’s face, then it can detect the cat anywhere in the input.

Additionally, training a smaller number of weights in a kernel is going to be far, far more efficient than trying to train weights for every single item in the input.

T and L revisited

Let’s expand our problem set from chapter 2, and make the T and L images larger — to 8px by 8px. This will allow us to experiment with translational invariance (plenty of room for T and L characters to appear anywhere) and efficiency (64 pixels input flowing through the model).

We will keep to ‘greyscale’ images, i.e. without the red, green and blue dimensions. This will still give us plenty to play with, but allow us to keep the concepts clear.

Here is a sample training set for our model. Notice how each letter appears at a different position in the grid — this is exactly the kind of variation our model needs to handle:

Training samples — same letters, different positions
T
T
T
T
L
L
L
L

CNN architecture

A CNN is composed of several different layers, each performing a task. Different models will use different ‘recipes’ of these layers depending on the task, but these layers will usually be present in some form.

Before we build our CNN, let’s have a quick overview of the layers, their roles and their common names so we know what we are building.

Convolutional layer: Conv2d

As discussed, the key idea here is to use a sliding window (the kernel) that moves across the input, evaluating a patch at a time. This bakes in the assumption that patterns in the input are translation invariant - i.e. a feature learned once can appear anywhere in the input. This drastically reduces the number of parameters compared to a fully connected layer.

Numerous studies have established that, for the most part, a 3x3 kernel is usually optimal. It offers the best trade offs between size and accuracy in most situations.

RGB vs Greyscale images

A greyscale image is simply an array of pixels, where each pixel has a value corresponding to the darkness of the colour.

Colour images have 3 values per pixel (R, G, B) instead. So for a colour image, our 3x3 kernel would have shape:

    R channel      G channel      B channel
    ┌─────┐        ┌─────┐        ┌─────┐
    │■ ■ ■│        │■ ■ ■│        │■ ■ ■│
    │■ ■ ■│   +    │■ ■ ■│   +    │■ ■ ■│   = 27 input values
    │■ ■ ■│        │■ ■ ■│        │■ ■ ■│
    └─────┘        └─────┘        └─────┘

So each 3x3 kernel has 27 weights, plus 1 bias term = 28 parameters to learn.

const rgbKernel3x3 = [
 // Red
 [[1,2,3],[4,5,6],[7,8,9]],
 // Green
 [[1,2,3],[4,5,6],[7,8,9]],
 // Blue
 [[1,2,3],[4,5,6],[7,8,9]],
]

A greyscale image is 1D, so would have 9+1 = 10 parameters:

const greyscaleKernel3x3 = [
 [[1,2,3],[4,5,6],[7,8,9]]
]

Let’s keep to greyscale for our toy model, to keep things simple.

Kernel output

The kernel will produce a single output value, which is the dot product of the values within the kernel and the corresponding input values, plus the bias.

With training, a kernel’s weights adapt to emphasise certain features in the input, for example ‘edges’ (areas around ‘on’ pixels).

The interactive tool below shows how this works:

Kernel:Input:
Input (8×8)
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
Kernel (3×3)
+1
+1
+1
+1
+1
+1
+1
+1
+1
Position: (0,0)1 / 36
0×+1 + 0×+1 + 0×+1 + 0×+1 + 1×+1 + 1×+1 + 0×+1 + 0×+1 + 0×+1 = 2
Output (6×6)
2
4
4
4
2
1
2
5
5
5
2
1
0
3
3
3
0
0
0
3
3
3
0
0
0
3
3
3
0
0
0
2
2
2
0
0

Counts filled neighbours — produces a blurred version

Composing multiple kernels

Multiple such kernels can be used to produce multiple output channels, allowing the network to learn different features. For example the first kernel might learn to detect edges, the second to detect textures, etc. These are then built by the network into more complex feature detectors, layer by layer:

A left-curve detector is excited by left curves…
left curve
matching weights
strong
…and inhibited by opposing curves.
right curve
same weights
suppressed

These outputs become the input to the next layer, which combines them:
Layer 1
left curve
Layer 1
right curve
Layer 2
invariant curve detector

Boundaries, stride and padding

Since we are sliding the kernel across the input, we can’t place it where the window would extend past the edge. This means the output is always smaller than the input, and edge pixels get less attention from the kernel.

Stride controls how many steps the kernel takes between positions. A stride of 1 visits every position; a stride of 2 skips every other one, halving the output size.

Padding adds a border of zeros around the input so the kernel can visit edge pixels more evenly. Since the padded values are zero, they don’t affect the dot product.

Try toggling these in the explorer below and watch how the number of valid positions changes:

Padding:Stride:
1
1
1
1
1
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
1/9
Output size
(53 + 2×0) / 1 + 1
= 3×3 (9 positions)
Current
kernel at (0,0)

For our 8x8 greyscale T / L classifier, we will use a stride of 1 and no padding, so the output size formula simplifies to inputSize - kernelSize + 1. Each 3x3 kernel will output a 6x6 matrix (8 - 3 + 1 = 6).

Max pooling layer: MaxPool2D

After the conv2d layer, the next layer is a Max pooling layer.

The idea here is to reduce the size of the input and force the network to focus on the most important features. It also adds positional invariance, since the exact location of a feature becomes less important after pooling.

This is done by taking the maximum value from a window (poolSize x poolSize) and discarding the rest.

Input (4×4)
1
3
2
4
0
2
1
3
5
6
7
8
4
2
6
5
2×2 regions
1
3
2
4
0
2
1
3
5
6
7
8
4
2
6
5
Output (2×2)
3
4
6
8

Flatten and Dense layers

A ‘flatten’ layer simply converts a multi-dimensional array to 1D. This means we can take a multidimensional output (from each MaxPool2D layer) and turn it into a flat input suitable for a linear or dense layer.

So if we had two MaxPool2D outputs at 2x2 each, we could flatten them into a single 8D vector.

[2] × [2] × [2]
3
4
6
8
2
5
3
7
[8]
3
4
6
8
2
5
3
7

We already know the ‘dense’ layer — it’s an MLP. These are sprinkled throughout the model to learn the features that filter through the other layers.

[8] input
3
4
6
8
2
5
3
7
Dense (4)
neuron
neuron
neuron
neuron
Output (2)
T
L

Building out the conv2d layer

We can use our autograd primitives from the previous chapter to build this model: value, multiply, sum, and so on. However, we will need a couple of extra methods.

A dot product helper will tidy up some of the calculations:

// Calculate a dot product
function dot(a, b) {
   return a.reduce(
     (total, a_i, i) => sum(total, multiply(a_i, b[i])),
     toNode(0)
   );
}

…and another area we can tidy up is with weight initialisation. Up to now, we’ve been using a random number between -1 and 1 for our weights, but in practice there are better ways to do this.

Longer input sequences tend to produce larger weighted sums and outputs, risking gradients explosion. A more stable approach is to use a method that takes into account the size of the input layer.

We can use the Kaiming He initialisation, which produces a number that helps to keep gradients stable. Understand that there are a large number of initialisation functions used in the wild; this one is relatively simple and serves our purpose.

Note the terminology: ‘fan in’ refers to the inputs.

/**
* He initialization - specifically designed for ReLU activations.
* Scales weights based on the number of input connections to prevent
* vanishing/exploding gradients.
*
* Formula: weight ~ Uniform(-limit, limit) where:
*   limit = sqrt(6 / fan_in)
*
* @param fanIn - Number of input connections 
*               (for conv: inChannels * kernelSize^2)
*/
function heInitialize(fanIn) {
  const limit = Math.sqrt(6 / fanIn);
  return toNode((Math.random() * 2 - 1) * limit);
}

This means we can set up our conv2dLayer function with all the internal values we will need.

Regarding parameters, we will need inputSize, kernelSize, stride and padding.

Note that you will normally see inputChannels as a parameter, but since we are working with greyscale images (a single channel), we can ignore for simplicity.

Similarly, a convolutional layer would usually have multiple kernels, each producing a different feature map as an output. These are called the output channels. Again, we will ignore this for now and just build a single kernel that produces a single output channel.

function conv2dLayer(
  inputSize = [8, 8], 
  kernelSize = 3, 
  stride = 1, 
  padding = 0
) {
  // Create our 3x3 sliding window, initialised with He initialization
  const fanIn = kernelSize * kernelSize 
  const kernel = Array.from({ length: fanIn }, () => heInitialize(fanIn))
  // One bias for now
  const bias = toNode(0)
  // Calculate the output size based on the input size, kernel size, 
  // stride and padding
  const [inputRows, inputCols] = inputSize
  // Output size formula: (input - kernel + 2 * padding) / stride + 1
  // With stride = 1 and padding = 0, this simplifies to:
  const outputRows = inputRows - kernelSize + 1
  const outputCols = inputCols - kernelSize + 1

  return {
   forward(input) {
     // TODO
   },
   parameters() {
     // We will need to return all the parameters (weights and bias) 
     // for training
     return [...kernel, bias]
   }
  }
}

Ready for the forward pass.

The forward pass

The sliding window

Let’s sketch out the code that will form the sliding window.

This can often turn into a soup of nested loops, so we will break it down into helper functions to keep it clean and easy to reason about. Also, this operation is used again in the pooling layer, so it will be good to make it reusable.

Let’s make a getPatch helper that extracts a ‘patch’ of values of a specified size from the input array at a given position. This is the core “sliding window” operation.

function getPatch(input, row, col, size) {
  const patch = []
  for (let pRow = 0; pRow < size; pRow++) {
    for (let pCol = 0; pCol < size; pCol++) {
      patch.push(input[row + pRow][col + pCol])
    }
  }
  return patch
}

Now we can put it all together in the forward pass. We create the output array, then use the slider to apply the patch across the input. Then we can take each patch and calculate the dot product with the kernel, adding the bias to the result.

function forward(input) {
   // The feature map (output) is the window of the input that we 
   // are able to slide the kernel over without going out of 
   // bounds. It will be smaller than the input.
   const featureMap = Array.from({ length: outputRows }, () =>
     Array.from({ length: outputCols })
   )

  for (let row = 0; row < outputRows; row++) {
    for (let col = 0; col < outputCols; col++) {
      const patch = getPatch(input, row, col, kernelSize)
      featureMap[row][col] = sum(dot(patch, kernel), bias)
    }
  }

  return featureMap
 }

Let’s run it and see what happens. We’ll pick out a couple of positions to inspect — one where the kernel lands on part of the letter, and one where it lands on empty space:

import { dot, heInitialize, toNode, sum, multiply } from './engine.js'
import { getPatch } from './helpers.js'

function conv2dLayer(inputSize = [8, 8], kernelSize = 3) {
const fanIn = kernelSize * kernelSize
const kernel = Array.from({ length: fanIn }, () => heInitialize(fanIn))
const bias = toNode(0)

const [inputRows, inputCols] = inputSize
const outputRows = inputRows - kernelSize + 1
const outputCols = inputCols - kernelSize + 1

return {
  forward(input) {
    const featureMap = Array.from({ length: outputRows }, () =>
      Array.from({ length: outputCols })
    )
    for (let row = 0; row < outputRows; row++) {
      for (let col = 0; col < outputCols; col++) {
        const patch = getPatch(input, row, col, kernelSize)
        featureMap[row][col] = sum(dot(patch, kernel), bias)
      }
    }
    return featureMap
  },

  parameters() {
    return [...kernel, bias]
  }
}
}

// --- Run it ---

// A T in the top-left of an 8x8 grid
const T = [
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
]

const { forward, parameters } = conv2dLayer([8, 8], 3)
const featureMap = forward(T)

console.log(
"Feature map: " +
featureMap.length + " rows x " +
featureMap[0].length + " cols"
)
console.log(
"Parameters: " + parameters().length +
" (9 kernel weights + 1 bias)"
)
console.log("")

// Show what the kernel sees at a few positions
function showPosition(label, row, col) {
const patch = getPatch(T, row, col, 3)
console.log(label)
for (let r = 0; r < 3; r++) {
  console.log("  " + patch.slice(r * 3, r * 3 + 3).join(" "))
}
console.log("")
}

showPosition("Position (0, 0) - empty corner:", 0, 0)
showPosition("Position (0, 1) - top bar of T:", 0, 1)
showPosition("Position (1, 2) - bar meets stem:", 1, 2)

Max pooling layer

The max pooling layer works, as discussed above, by downsampling the input. It takes the maximum value from a window of the input, and discards the rest.

It needs to take an input size, a pool size (the size of the window), and a stride just like the convolutional layer, and it will also output an array.

As discussed, the conv2dLayer will output a 6*6 ‘feature map’, and our pooling layer will take that as an input.

We then traverse the 6x6 feature map using a 2×2 pool (window) with stride 2, taking just the maximum value from each pool. This means the output will be 3×3: each non-overlapping 2×2 block collapses to its maximum value.

Again, a helper or two here will simplify things.

A max primitive

We can’t just use Math.max here to extract the maximum value — the values flowing through it are node objects from our gradient engine. We also need gradients to flow back through the pooling layer during training.

Therefore, we will need to build a max primitive that takes two nodes as input, and outputs a node with the maximum value.

The gradient rule is simple: only the winning (largest) value gets the gradient, the rest get zero. This is the same intuition as ReLU.

export function max(a, b) {
  a = toNode(a)
  b = toNode(b)
  const node = toNode(
    Math.max(a.value, b.value), 
    [a, b]
  )
  node.op = 'max'
  node.backward = () => {
    // Gradient flows only to the winner
    if (a.value >= b.value) {
      a.grad += node.grad
    } else {
      b.grad += node.grad
    }
  }
  return node
}

With that in place, the pool layer itself is straightforward. It’s another sliding window — we can reuse our getPatch helper — but instead of a dot product we just take the max of the patch:

function maxPool2dLayer(inputSize = [6, 6], poolSize = 2, stride = 2) {
  const [inputRows, inputCols] = inputSize
  const outputRows = Math.floor((inputRows - poolSize) / stride) + 1
  const outputCols = Math.floor((inputCols - poolSize) / stride) + 1

  return {
    forward(input) {
      const pooledMap = Array.from({ length: outputRows }, () =>
        Array.from({ length: outputCols })
      )

      for (let row = 0; row < outputRows; row++) {
        for (let col = 0; col < outputCols; col++) {
          const x = row * stride, y = col * stride
          const patch = getPatch(input, x, y, poolSize)
          pooledMap[row][col] = patch.reduce(
            (best, val) => max(best, val),
            toNode(-Infinity)
          )
        }
      }

      return pooledMap
    },
    parameters() {
      // Max pooling has no learnable parameters
      return []
    }
  }
}

On to the flatten and dense layers.

Flatten and dense layers

Flattening is a simple operation that takes a multi-dimensional input array and turns it into a 1D array.

We need to do this so we can feed the output of the convolutional and pooling into an MLP (dense layer), and the latter needs a 1D vector as input.

In JavaScript, we can just use Array.flat() for this.

The MLP itself is the same as we have been building:

  /**
   * Multi-layer perceptron (AKA dense, linear or fully connected layer)
   * Note that the `layer` function is also pulled in from our earlier code. 
   */
  export function mlp(inputSize, layerSizes, activations = []) {
    const layers = []
    let prevSize = inputSize
    for (let i = 0; i < layerSizes.length; i++) {
      const size = layerSizes[i]
      const act = activations[i] || relu
      layers.push(layer(prevSize, size, act))
      prevSize = size
    }

    function forward(x) {
      let out = x
      for (let l of layers) {
        out = l.forward(out)
      }
      return out
    }

    function parameters() {
      return layers.flatMap((l) => l.parameters())
    }

    return { layers, forward, parameters }
  }

And with that, we have all the building blocks we need to construct our CNN.

The full architecture

Our single-kernel CNN wires up as follows:

We also need one final, small helper to convert our flat pixel T and L arrays into the 2D grids that the conv2dLayer expects.

function to2D(flat, size = 8) {
  const grid = []
  for (let r = 0; r < size; r++) {
    grid.push(flat.slice(r * size, (r + 1) * size))
  }
  return grid
}

The training loop uses MSE loss (mean squared error), which we covered in earlier chapters — we square the difference between prediction and target, average across all samples, then backpropagate and update.

Try training the model with 200 epochs. You should see it struggle, misclassifying several samples. Then try bumping epochs to 2000 and running again.

import { toNode, sum, multiply, power, subtract, relu, max, dot, heInitialize, backward, identity } from './engine.js'
import { getPatch, to2D } from './helpers.js'
import { neuron, layer, mlp } from './model.js'
import { data } from './data.js'

// --- CNN layers ---

function conv2dLayer(inputSize = [8, 8], kernelSize = 3) {
const fanIn = kernelSize * kernelSize
const kernel = Array.from({ length: fanIn }, () => heInitialize(fanIn))
const bias = toNode(0)

const [inputRows, inputCols] = inputSize
const outputRows = inputRows - kernelSize + 1
const outputCols = inputCols - kernelSize + 1

return {
  forward(input) {
    const featureMap = Array.from({ length: outputRows }, () =>
      Array.from({ length: outputCols })
    )
    for (let row = 0; row < outputRows; row++) {
      for (let col = 0; col < outputCols; col++) {
        const patch = getPatch(input, row, col, kernelSize)
        featureMap[row][col] = relu(sum(dot(patch, kernel), bias))
      }
    }
    return featureMap
  },
  parameters: () => [...kernel, bias],
  kernel,
}
}

function maxPool2dLayer(inputSize = [6, 6], poolSize = 2, stride = 2) {
const [inputRows, inputCols] = inputSize
const outputRows = Math.floor((inputRows - poolSize) / stride) + 1
const outputCols = Math.floor((inputCols - poolSize) / stride) + 1

return {
  forward(input) {
    const pooledMap = Array.from({ length: outputRows }, () =>
      Array.from({ length: outputCols })
    )
    for (let row = 0; row < outputRows; row++) {
      for (let col = 0; col < outputCols; col++) {
        const patch = getPatch(input, row * stride, col * stride, poolSize)
        pooledMap[row][col] = patch.reduce(
          (best, val) => max(best, val),
          toNode(-Infinity)
        )
      }
    }
    return pooledMap
  },
  parameters: () => [],
}
}

// --- Wire up the single-kernel CNN ---
// 8x8 -> conv 3x3 + relu -> pool 2x2 -> flatten (9) -> dense 9->4->1

const conv = conv2dLayer([8, 8], 3)
const pool = maxPool2dLayer([6, 6], 2, 2)
const dense = mlp(9, [4, 1], [relu, identity])

function cnnForward(flatInput) {
const input = to2D(flatInput)
const convOut = conv.forward(input)
const poolOut = pool.forward(convOut)
return dense.forward(poolOut.flat())
}

const params = [...conv.parameters(), ...dense.parameters()]
console.log("Parameters: " + params.length + " (conv: " + conv.parameters().length + ", dense: " + dense.parameters().length + ")")

// --- Train (try changing this to 2000!) ---

const epochs = 200
const lr = 0.01

for (let epoch = 0; epoch < epochs; epoch++) {
let totalLoss = toNode(0)
for (const { pixels, y } of data) {
  const pred = cnnForward(pixels)
  // MSE loss: mean of squared errors
  totalLoss = sum(totalLoss, power(subtract(pred, y), 2))
}
totalLoss = multiply(totalLoss, 1 / data.length)
backward(totalLoss)
params.forEach(p => { p.value -= lr * p.grad })

if (epoch % 40 === 0) {
  console.log("Epoch " + epoch + ", loss: " + totalLoss.value.toFixed(4))
}
}

// --- Results ---

console.log("")
let correct = 0
for (const { label, pixels, y } of data) {
const pred = cnnForward(pixels)
const cls = pred.value > 0.5 ? "T" : "L"
if (cls === label) correct++
console.log(label + " -> " + cls + " (raw: " + pred.value.toFixed(4) + ")")
}
console.log("\nAccuracy: " + correct + "/" + data.length)

// --- Show the learned kernel ---

const kw = conv.kernel.map(k => k.value)
console.log("\nKernel weights:")
for (let r = 0; r < 3; r++) {
console.log("  " + kw.slice(r * 3, r * 3 + 3).map(w => (w >= 0 ? "+" : "") + w.toFixed(3)).join("  "))
}

// --- Render kernel heatmap (check 'Show preview') ---

import { renderHeatmap } from './viz.js'
renderHeatmap(kw, 3, "Learned kernel (3x3)")

Limitations of a single kernel

You should see that at higher epochs, the model will usually converge. This is because eventually the single kernel learns to try to do all the feature detection on its own. It is effectively multi-tasking and this produces an answer, but it is a slightly brittle representation.

If we look at the feature maps — what the kernel actually ‘sees’ when it looks at a T or an L — from several runs at 8/8 accuracy, you can see this multi-tasking in action.

What a single kernel sees — feature maps from 3 separate training runs (2000 epochs)
Run 1
T input
L input
Run 2
T input
L input
Run 3
T input
L input

You can see the ‘ghost’ of the letters in the activations. The kernel is responding to something about the shapes, but each run finds a different “something”.

The single kernel is trying to encode multiple features at once, and the dense layer has to untangle whatever it settles on.

This is a key insight: with one kernel, the model can solve the problem, but the learned representation is inconsistent and hard to interpret. We can’t look at it and clearly understand what it learned.

Adding more kernels

The fix is straightforward: give the conv2dLayer multiple kernels, each free to specialise.

The change to our code is small. Instead of a single kernel, we create an array of them:

// Before: single kernel
const kernel = Array.from({ length: fanIn }, () => heInitialize(fanIn))
const bias = toNode(0)

// After: multiple kernels
function conv2dLayer(
  inputSize = [8, 8], 
  kernelSize = 3,
  outChannels = 4 // New parameter: how many kernels (output channels)
) {
  const fanIn = kernelSize * kernelSize
  const kernels = Array.from({ length: outChannels }, () =>
    Array.from({ length: fanIn }, () => heInitialize(fanIn))
  )
  const biases = Array.from({ length: outChannels }, () => toNode(0))
  // ...
}

The forward pass now returns an array of feature maps — one per kernel. Each kernel slides across the same input independently:

forward(input) {
  // Each kernel produces its own feature map
  return kernels.map((kernel, ch) => {
    const featureMap = Array.from({ length: outputRows }, () =>
      Array.from({ length: outputCols })
    )
    for (let row = 0; row < outputRows; row++) {
      for (let col = 0; col < outputCols; col++) {
        const patch = getPatch(input, row, col, kernelSize)
        featureMap[row][col] = relu(sum(dot(patch, kernel), biases[ch]))
      }
    }
    return featureMap
  })
}

The pooling layer applies the same max-pool to each channel independently. And the flatten step now uses .flat(2) instead of .flat() since we’re collapsing a 3D structure (channels × rows × cols) into 1D:

// 4 channels × 3×3 pooled = 36 inputs to the dense layer
const flat = poolOut.flatMap(ch => ch.flat())

With 4 kernels, the dense layer now receives 36 inputs (4 × 9) instead of 9, giving it much richer information to classify with.

The final model

Here is the full multi-kernel model. Notice how much faster it converges — typically reaching 100% accuracy within 200-300 epochs:

import { toNode, sum, multiply, power, subtract, relu, max, dot, heInitialize, backward, identity } from './engine.js'
import { getPatch, to2D } from './helpers.js'
import { neuron, layer, mlp } from './model.js'
import { data } from './data.js'

// --- Multi-kernel conv layer ---

function conv2dLayer(inputSize = [8, 8], kernelSize = 3, outChannels = 4) {
 const fanIn = kernelSize * kernelSize
 const kernels = Array.from({ length: outChannels }, () =>
   Array.from({ length: fanIn }, () => heInitialize(fanIn))
 )
 const biases = Array.from({ length: outChannels }, () => toNode(0))

 const [inputRows, inputCols] = inputSize
 const outputRows = inputRows - kernelSize + 1
 const outputCols = inputCols - kernelSize + 1

 return {
   forward(input) {
     return kernels.map((kernel, ch) => {
       const featureMap = Array.from({ length: outputRows }, () =>
         Array.from({ length: outputCols })
       )
       for (let row = 0; row < outputRows; row++) {
         for (let col = 0; col < outputCols; col++) {
           const patch = getPatch(input, row, col, kernelSize)
           featureMap[row][col] = relu(sum(dot(patch, kernel), biases[ch]))
         }
       }
       return featureMap
     })
   },
   parameters: () => [...kernels.flat(), ...biases],
   kernels,
 }
}

function maxPool2dLayer(inputSize = [6, 6], poolSize = 2, stride = 2) {
 const [inputRows, inputCols] = inputSize
 const outputRows = Math.floor((inputRows - poolSize) / stride) + 1
 const outputCols = Math.floor((inputCols - poolSize) / stride) + 1

 return {
   forward(channels) {
     return channels.map(input => {
       const pooledMap = Array.from({ length: outputRows }, () =>
         Array.from({ length: outputCols })
       )
       for (let row = 0; row < outputRows; row++) {
         for (let col = 0; col < outputCols; col++) {
           const patch = getPatch(input, row * stride, col * stride, poolSize)
           pooledMap[row][col] = patch.reduce(
             (best, val) => max(best, val),
             toNode(-Infinity)
           )
         }
       }
       return pooledMap
     })
   },
   parameters: () => [],
 }
}

// --- Wire up the multi-kernel CNN ---
// 8x8 -> conv 3x3 (4 kernels) + relu -> pool 2x2 -> flatten (36) -> dense 36->8->1

const NUM_KERNELS = 4
const conv = conv2dLayer([8, 8], 3, NUM_KERNELS)
const pool = maxPool2dLayer([6, 6], 2, 2)
const dense = mlp(NUM_KERNELS * 9, [8, 1], [relu, identity])

function cnnForward(flatInput) {
 const input = to2D(flatInput)
 const convOut = conv.forward(input)
 const poolOut = pool.forward(convOut)
 const flat = poolOut.flatMap(ch => ch.flat())
 return dense.forward(flat)
}

const params = [...conv.parameters(), ...dense.parameters()]
console.log("Parameters: " + params.length + " (conv: " + conv.parameters().length + ", dense: " + dense.parameters().length + ")")

// --- Train ---

const epochs = 500
const lr = 0.01

for (let epoch = 0; epoch < epochs; epoch++) {
 let totalLoss = toNode(0)
 for (const { pixels, y } of data) {
   const pred = cnnForward(pixels)
   totalLoss = sum(totalLoss, power(subtract(pred, y), 2))
 }
 totalLoss = multiply(totalLoss, 1 / data.length)
 backward(totalLoss)
 params.forEach(p => { p.value -= lr * p.grad })

 if (epoch % 50 === 0) {
   console.log("Epoch " + epoch + ", loss: " + totalLoss.value.toFixed(6))
 }
}

// --- Results ---

console.log("")
let correct = 0
for (const { label, pixels, y } of data) {
 const pred = cnnForward(pixels)
 const cls = pred.value > 0.5 ? "T" : "L"
 if (cls === label) correct++
 console.log(label + " -> " + cls + " (raw: " + pred.value.toFixed(4) + ")")
}
console.log("\nAccuracy: " + correct + "/" + data.length)

// --- Show the learned kernels ---

console.log("\nLearned kernels:")
conv.kernels.forEach((k, i) => {
 const w = k.map(n => n.value)
 console.log("  Kernel " + i + ": [" + w.map(v => (v >= 0 ? "+" : "") + v.toFixed(3)).join(", ") + "]")
})

// --- Render kernel heatmaps (check 'Show preview') ---

import { renderKernels } from './viz.js'
renderKernels(conv.kernels)

What did the model learn?

The component below trains our CNN on the T and L dataset, then lets you draw on an 8×8 grid and classify your drawing in real time.

Obviously, with 8 sample Ts and Ls in the data, this is not a robust model, but it works surprisingly well. Have a go at drawing your own T or L and see if the model classifies it correctly.

Draw a T or L and classify it
Click Train model to train a 4-kernel CNN, then draw below to classify.

The limitations of CNNs

CNNs are a powerful and simple architecture for image data, but they have limitations. They work for images, for example, because the structure is always ‘pixels in a grid’. But what about sequences where the length varies?

Certain datasets, such as text, audio, time series, require the model to learn relationships that are not local in space, across inputs of varying length, but CNNs have no concept of order or memory.

For example, consider a sentence like ‘the cat that sat on the mat is hungry’ — the word ‘hungry’ relates to ‘cat’, but they are 7 words apart.

A CNN would need to learn to detect features in both locations and then combine them in the dense layer, which is possible but you can see it would be inefficient and brittle.

What we need is a model that can learn sequences.

In the next chapter, we will look at a sequential architecture, and we will continue to examine how models learn and find ways to peek inside and understand their learned representations.