META: The code block editor wasn't very friendly and ate up all of my tabs. I'm working on better formatting, and this'll probably end up being a post on my own blog later on, which will hopefully also have things like syntax highlighting.
Fixed the code blocks for you. The trick was to press CTRL+Shift+V
to make sure that your browser doesn't try to do any fancy formatting to your code. Sorry for the inconvenience.
Oh, wow! I didn't realize that could have been tripping things up. Thank you for the formatting help!
Introduction
This post is an attempt to explain how to write a neural network in Python using numpy. I am obviously not the first person to do this. Almost all of the code is here adapted from Michael Nielsen's fantastic online book Neural Networks and Deep Learning . Victor Zhou also has a great tutorial in Python . Why am I trying to do the same? Partially, it's for my own benefit, cataloging my code so I can refer back to it later in a form more captivating than a mere docstring. Also partially, I think I can share a few intuitions which make the backpropagation equations a lot easier to derive.
Okay, so here's a typical picture of a neural network:
If you're new to all this: A neural network is a function that takes in an input vector (or matrix) and outputs another vector (or matrix). The input starts at the leftmost vertical layer of nodes and then gets transformed, via a series of operations, to the rightmost vertical layer of nodes. Each layer is a linear combination of the layer before it, followed by an activation function, which is applied to each node. In other words, a neural net is parameterized by a series of weight matrices W1,W2,.., a series of bias vectors, b1,b2,..., and an activation function a (typically a nonlinear function like tanh(x) which is applied element-wise).
The typical picture, while good for representing the general idea of a neural net, does not do a good job of showing the different operations being performed. I prefer representing a neural net as a computational graph, like below:
Here, it's clearer to see how each node is a function of the step before it. A normal three-layer neural network is given by the following composition of functions:
f0=X=input
f1=W1⋅f0+b1
f2=a(f1)
f3=W2⋅f2+b2
f4=a(f3)=^Y=predicted output
This recursive definition will make it easy to derive the backpropagation algorithm, which we'll use to train our network. It also allows us to easily unroll the function, if we want to see what's going on in on line, by substituting until we get back to the input:
f4=a(W2⋅(a(W1⋅f0+b1)+b2))
And of course, if our neural network has more than three layers, we just add more recursively defined functions.
Forward Pass
The process of taking an input and going through the process of matrix multiplications, vector additions, and activation functions to get the output is referred to as the **forward pass**. To get started, let's write a neural net class that can perform a forward pass, given the dimensions for each layer and an activation function:
A quick explanation: our class takes in an array of layer sizes and creates appropriate weight matrices with random values from 0 to 1. EX:
[20, 50, 10]
would result in weight matrices of dimensions 20×50 and 50×10. For theforward_pass
function, we can see from the computational graph that each matrix multiplication (and vector addition) is followed by an application of the activation function, so we can simply recurse until we go through all of our weight matrices.Loss Function
So far, I haven't explained how the neural net is supposed to actually work. Say we have some data and their associated target values (EX: different measurements of divers and how long they can hold their breath). Using the above code, even if we get the dimensions of the input/output right, our forward pass is going to give us garbage results.
This is because we randomly initialized our weights and biases. We don't want *any* set of weights and biases, but a "good" set of weights and biases. In doing so, we now need to define what we mean by "good". At the very least, it seems that a good set of weights and biases should lead to predicted values which are close to the associated target values, for most of the data we have.
This is where loss functions come in. They take in as input our predicted value and the true value and output a measure of just how far apart the two values are. There are many functions we could choose to measure the distance between Y and ^Y. For ease of explanation, we'll go with L2 norm of their difference, i.e. the sum of the squares of their differences.
Now, after taking a forward pass, we can use our loss function to tell us just how far away our predicted value is from the true value. It's easy to add this change to our class:
We can easily integrate this into our model by adding it to our computational graph:
Now we've added one more step:
f0=X=input
f1=W1⋅f0+b1
f2=a(f1)
f3=W2⋅f2+b2
f4=a(f3)=^Y=predicted output$
f5=∥f4−Y∥2
Now that we have our loss function defined, we can begin the work of actually optimizing our network because we have an answer to the question, "Optimizing with respect to what?"
Recall that our network is parameterized by a set of weights and biases. (There's also the activation function, but that's more of a fixed thing we can't really fine-tune.)
Backpropagation
Backpropagation allows us to figure out how much each weight and bias is responsible for the loss function. We do this by taking partial derivatives of the loss function with respect to each weight matrix and bias vector. Given that a neural net is just a big composite function, we'll be using the Chain Rule a lot. This is where the recursive notation shines. It's much easier to have a placeholder like $f_3$ than a big clump of nested parentheses.
The reason we are taking partial derivatives at all is because they'll allow us to perform iterative optimization, e.g. gradient descent, on our network, which is how the "training" happens.
We'll start with the biases b1,b2,... first:
First, let's find ∂f5∂b2
From above, we've already defined f6 to be the loss function applied after a forward pass, so that's why we're taking the partial derivative of f5 with respect to b2. Note below that $a'$ is the derivative of the activation function.
∂f5∂b2=∂f5∂f4∂f4∂b2=2(^Y−Y)∂f4∂b2
∂f4∂b2=∂f4∂f3∂f3∂b2=a′(f3)∂f3∂b2
∂f3∂b2=1
Thus, ∂f5∂b2=2(^Y−Y)a′(f3).
Next, let's find ∂f5∂b1
(Below, I've omitted the intermediary step of showing ∂g∂x=∂g∂f∂f∂x.)
∂f5∂b1=2(^Y−Y)∂f4∂b1
∂f4∂b1=a′(f3)∂f3∂b1
∂f3∂b1=W2⋅∂f2∂b1
∂f2∂b1=a′(f1)∂f1∂b1
∂f1∂b1=1
Thus ∂f5∂b1=2(^Y−Y)a′(f3)⋅W2⋅a′(f1).
Before we go any further, there are a two useful things to notice:
1. The loss function's derivative (in this case, 2(^Y−Y)) will always be the first term in the partial derivative of the loss with respect to any weight or bias.
2. The partial derivatives of the bias vectors is recursively defined. ∂L∂bn−1=∂L∂bn⋅Wn⋅a′(zn−1) where zc is defined to be the result of Wc⋅f2c−2+bc. In other words, zcis the result of multiplying the previous layer by the cth weight matrix and adding the cth bias vector. We let L represent the general loss function, applied after an arbitrary number of layers.
Let's do the weight matrices W1,W2,... next:
First, let's find ∂f5∂W2
∂f5∂W2=2(^Y−Y)∂f4∂W2
∂f4∂W2=a′(f3)∂f3∂W2
∂f3∂W2=f2=a(f1)
Thus, ∂f5∂W2=2(^Y−Y)a′(f3)f2=∂f5∂b1a(f1).
Now we find ∂f5∂W1
∂f5∂W1=2(^Y−Y)∂f4∂W1
∂f4∂W1=a′(f3)∂f3∂W1
∂f3∂W1=W2⋅∂f2∂W1
∂f2∂W1=a′(f1)∂f1∂W1
∂f1∂W1=f0=X
Thus ∂f5∂W1=2(^Y−Y)a′(f3)⋅W2⋅a′(f1)f0=∂f5∂b1f0
Here, in both partial derivatives, we see something useful: The partial derivative of the loss function with respect to a weight matrix can be calculated in part using the partial derivative of the loss function with respect to the bias vector in the same layer. The extra term we need is the activation function applied element-wise to the layer before it.
In other words: $\frac{\partial \text{L}}{\partial W_n} = \frac{\partial \text{L}}{\partial b_n} a(z_{n-1})$.
Thus, as long as we store both the results of $z_c$ and $a(z_c)$ during a forward pass operation, we'll have most of the information we need to calculate the partial derivatives.
We're now ready to write the code:
To start with we perform a forward pass. Along the way, we store the results in
activations
andz
. One small caveat: we start withx
inactivations
as well because our recursive definition bottoms out at the input value, so we need for the gradients at the first layer.Now, we go backwards and recursively calculate our gradients:
deltas
is a list holding $\frac{\partial L}{\partial b_c}$ values. The first case handles the derivative of the loss function. We pass it inactivations[-1]
which represents the output of our neural net (as it's the activation of the last layer) and multiply it byactiv.deriv
, the derivative of the activation function (which we assume we've defined elsewhere).Otherwise, we follow the recursive formula from earlier and multiply the previous
delta
value by the next weight matrix and we multiply it by a′of the next znvalue. To get the ∂L∂bc value, we simply take the current value of delta (and sum up if our input was a matrix rather than a vector). To get the ∂L∂Wc value, we follow the recursive formula and perform one more matrix multiplication (we indexactivations`by [-2-i] because we added `x
as an extra value when starting out).And we're done! We've now calculated the partial derivatives for all the weights and biases. Next time, we'll dive into different optimization methods and go over how to put these gradients to use.