The ancestor of today’s machine learning models was invented in the late 1950s by Frank Rosenblatt, and was named ‘The Perceptron’.
It’s possible to build out a perceptron very simply, and they are a wonderfully manual, tactile way to develop an understanding of how machine learning works at its core. Today’s most advanced Large Language Models are essentially built out of millions of the modern equivalent of a perceptron.

The perceptron, while brilliant, had a critical flaw. When this was recognised, it led to the first ‘AI Winter’ in the mid 1970s. We will discuss how this was overcome in the next chapter.
Let’s build out a perceptron ourselves so we can understand the roots of everything that was to follow.
Classification
The problem that the Perceptron aimed to solve was slightly different from the one our linear model solved: how to classify things.
For example, how can we train a model to tell if an image is of a cat or a dog? Or to analyse handwriting? We will do this more rigorously later, when we tackle convolutional neural networks and diffusion, but for now, let’s consider a simplified problem.
Let’s imagine we need to train a model to tell the difference between the character
L and the character T.
Let’s use 4x4 ‘images’ of these characters, for example:
…and so on.
This is no longer a line of best fit problem, per se: it’s a classification
problem. We want to know: what class is the image - an L or a T?
Each of the 16 pixels can be ‘on’ or ‘off’. So at its heart, this problem is
about deciding whether each pixel being ‘on’ or ‘off’ makes it more or less likely
to be a T or an L.
We can use straight lines to do this, but rather than plotting a line of best
fit between continuous data points as with our linear model, the lines will form
a boundary between the two classes. If a sample falls on one side of the line,
it’s more likely to be a T; on the other, an L.
This line is called the decision boundary.
The Decision Boundary
We can’t easily draw a chart with more than two dimensions ( and ), and we clearly need 4x4 = 16 dimensions to capture the on/off state of all 16 pixels.
However, we can illustrate the idea of the decision boundary by taking a 2D sample
of the data. We can take a row and a column for and , and then plot their
values for a single T and a single L.
Notice:
- the top row (all ‘on’ in the
T, only one ‘on’ in theL)
- the left column (all ‘on’ in the
L, only one ‘on’ in theT)
If we add up the values for the row and the column (using ‘on’ = 1, ‘off’ = 0) for the two samples, we get:
| Top row () | Left column () | |
|---|---|---|
T | 1+1+1+1 = 4 | 1+0+0+0 = 1 |
L | 1+0+0+0 = 1 | 1+1+1+1 = 4 |
Now we can plot the chart - see below.
You’ll see the values for the T cluster at , and the values for L
cluster in the opposite corner, at .
We can then cleanly separate the T and the L with a straight line to form
our decision boundary.
Now we know that if a new sample’s top row sums to around 4, it’s more likely to be
a T, and if its left column sums to 4, it’s more likely to be an L.
We can go further than this and say that if the value for (the top row) is
greater than 1, it’s likely to be a T, and if the value for (the left
column) is greater than 1, it’s likely to be an L.
This example shows a 2-dimensional ‘slice’ of the full 16-dimensional dataset, illustrating what the decision boundary does. Now it’s just a question of scaling up the dimensions. Conceptually, you can understand:
- In 2D, the decision boundary is a line
- In 3D it becomes a plane (like a 2D sheet through the , and space)
- In higher dimensions, it becomes a ‘hyperplane’, which is impossible to visualise.
However, the point is: the boundary is straight and it cuts between the feature clusters. The maths for a straight line will work.
So far, so good. We can still use the straight line equation, (or using weights and a bias), for the decision boundary. This will generate a number that represents a score — how likely the input was to belong to one class or the other. However, we have a couple of adjustments that we need to make to get the equation working for our use-case.
Multiple inputs
Our model needs to capture whether each of the 16 pixels is on or off in a given sample. This is clearly not something we can model with a single input value for as with our previous model — we need an input for each pixel.
That means the input, , for our model is now an array. The output (prediction),
, will be a binary (0 or 1) indicating whether the input is of class T or
class L:
// Our 'T' from above, expressed as an array
const input = [
1, 1, 1, 1,
0, 1, 0, 0,
0, 1, 0, 0,
0, 1, 0, 0,
]
// X is a 1x16 array
// the Y returned is 0 or 1
perceptron.predict(input), // e.g. 1 for T
How do we do this?
If is now an array of values, then we can scale our
(or y = weight * x + bias) by adding a weight for each value of .
Since we have 16 values for , we need 16 weights:
let w = [...// 16 weights]
let b = 0.5 // Bias
// Our input
let x = [...// 16 values]
// Calculate y = wx + b
let wx = (w[0] * x[0]) + (w[1] * x[1]) + (w[2] * x[2]) // ...all 16
let y = wx + b
Let’s sketch this out in code:
function perceptron(dimensions) {
let weights = Array(dimensions).fill(0)
let bias = 0
return {
// x is an array of 16 pixels
predict(x) {
let dotProduct = 0
for (let i = 0; i < dimensions; i++) {
// multiply each x pixel by its weight
// and add it to the sum
dotProduct += weights[i] * x[i];
}
// add the bias to get the weighted sum
let y = dotProduct + bias;
return y
}
}
}
const t = [
1,1,1,1,
0,1,0,0,
0,1,0,0,
0,1,0,0
]
const prediction = perceptron.predict(t) // outputs weighted sum
So now,
- The
predictmethod will take an input array of the 16 pixels - we will map each pixel to a corresponding weight,
- multiply each pixel by its weight ( for each one),
- add all these products together to get the dot product
- add the bias at the end to get the weighted sum
The symbols and terminology muddy the waters a bit, but hopefully the underlying method here is not too hard to grasp.
Now we are only missing one piece: how to convert our weighted sum into binary classification.
Thresholds
Our model is for classifying whether an input is a T or an L, which is a binary
output (i.e. one or the other). But as-is, our weighted sum will be a number
like 4.7 or -2.5, which won’t work.
So we need to somehow normalise our output so that it is a binary.
The way we do this is to add a threshold:
let dotProduct = ...
let weightedSum = dotProduct + bias // some number, like 4.2
return weightedSum > threshold ? 1 : 0
So what is threshold?
In ML, this is handled through some sort of activation function that takes the raw weighted sum and turns it into something useful (we will come back to this in the next chapter). In the perceptron, the activation function was a simple step function:
/**
* The step function turns a number into a binary output
*/
function step(x) {
// In the step function, the threshold can just be 0
return x > 0 ? 1 : 0
}
So if the weighted sum is a negative number, whatever it is, it will output 0.
If it’s a positive, it will output 1.
This simple binary output means we don’t even need gradient descent to train this model (in fact, gradient descent came later). This is because the binary output is either correct or incorrect, so we can nudge the weights directly.
So, for every pixel that was ‘on’ in this input, we just increase or decrease its corresponding weight depending on whether it was correct or incorrect. This will change its effect on the output.
If a pixel has a large bearing on the outcome (e.g. if the top right pixel is ‘on’,
it’s almost certainly a T), then it can have a more significant weight.
// In the training loop
const prediction = model.predict(x)
// Learning without gradient descent:
if(prediction !== actual) {
if(actual === 1) {
// should have output 1, so we need
// to make the weighted sum bigger
// for this specific input pattern
weights = weights.map((w, i) => w + x[i])
} else {
// should have been 0, make the
// weighted sum smaller for this
// specific pattern
weights = weights.map((w, i) => w - x[i])
}
}
If the end output should be higher, we want pixels that are “on” (1) to have more positive influence, so we increase their weights. If output should be lower, we decrease them.
Manual training
Training Rosenblatt’s original hardware perceptron was a very manual process, and dialling the weights in yourself is a great way to build an intuition for how the model works.
Try the manual perceptron trainer below. You’re doing the same update rule as the code, just by hand.
The goal is to get the perceptron to output 1 for T and 0 for L. Here are the
steps to the algorithm:
- Load a
Tpattern (or create one) and adjust weights until the output is1 - Next, load an
Lpattern (or create one) - does it still work? Again, adjust weights if needed until the output is 0. - Keep iterating until both patterns are classified correctly
You will notice a couple of things:
- You need very few ‘training loops’ to build a pretty robust model
- You will likely end up with positive weights in the top row (where
Ts tend to have pixels) and negative weights in the bottom row and left column (whereLs usually have pixels).
Implementing the Perceptron
We’ve got everything we need to build out a perceptron now, and we have also developed a grip on the maths behind the model, which will stand us in good stead as we move on to more advanced ideas.
Any model is only as good as its data, so let’s start there. We will use 10 samples of each class. We will include samples with ‘noise’ (random pixels) and missing pixels, which will help our model to generalise well.
Notes on the code
As we move on to more complex models, it’s worth reinforcing some ideas here.
No gradient descent
The perceptron predates backpropagation and uses a simpler update rule — if wrong, nudge weights directly by adding or subtracting the input values. You can probably sense that this won’t scale well.
Simpler hyperparameters
No learning rate (implicitly 1, since we nudge the weights by 1 or -1), no
normalization (inputs are already 0/1), no MSE loss (just right or wrong). We
only need epochs. You saw in the previous chapter that for any level of complexity,
hyperparameters are key.
Converges fast
Because the problem is linearly separable and the update rule is direct, we typically hit zero errors within a handful of epochs.
Initialisation is deterministic
With this code, the weights would be identical after every run, but that’s because the initialisation is deterministic (all zeros) and the data is the same.
However, there are infinitely many valid decision boundaries for linearly separable data like ours. You might have found with the manual trainer above that there were multiple valid solutions.
Below are several training runs using this code, and data, but I shuffled the data each time. You could also generate more examples by initialising the weights to random floats like we did with our linear regression model.
You can see that the model finds a solution, not the solution. As we build more complex models, we can use this principle of non-deterministic initialisation to make them more robust.
Summary
We’ve built a perceptron from scratch and learned the core mechanics that underpin modern ML:
- Weighted sums: multiplying inputs by weights and adding a bias ()
- Activation functions: converting raw outputs into something useful (here, a step function)
- Decision boundaries: straight lines (or hyperplanes) that separate classes
The perceptron works well for problems where a straight line can cleanly divide the classes — problems that are linearly separable.
But what happens when a straight line isn’t enough? In the next chapter, we’ll hit the perceptron’s fatal limitation and see how stacking perceptrons into layers finally cracked the problem.