# 15 – Neural Networks

Neural networks are quickly becoming omnipresent for many tasks, including image recognition, text summarizing and synthesis, and speech recognition. There are a couple of reasons why they’ve become so popular in recent years: first, we have a lot more data these days than earlier, and this means learning algorithms have more to learn from; second, computers are more powerful now, and hardware support in the form of GPUs makes neural networks significantly quicker to train; finally, a new “activation function” called ReLU (rectified linear unit) made neural networks significantly better at a lot of tasks.

Because of the depth of the field and the pace at which research in “deep learning”, as it is called, is progressing, it is practically impossible to concisely discuss all of neural networks in anything less than a (rather fat) book. We will certainly not go down every rabbit hole we see; rather, we will look at the foundations of neural networks–what are the building blocks, how does learning take place, and a few other questions that might arise. We will link to external content to help you learn so that you’re not left dangling not knowing where to go next. Unfortunately, even with just an overview, this will be a long post. Let’s start by talking about neural networks.

## Introduction

At a very high level, neural networks are a black box. They take in some inputs, do some magic, and then give you outputs that are extraordinarily accurate.

This is a high-level, don’t-care-about-any-details, diagram of neural networks. You chuck some data at it to train it. Later, you can give it inputs and expect highly accurate outputs. Those circles that you see are typically called nodes or neurons. The idea is to simulate what goes on in the human brain. Arrows in diagrams like this represent weighted connections. Let’s now open that box in the middle.

Inside, you simply have more neurons! These neurons are organized in layers, with the connections connecting neurons in one layer to neurons in the next. Images like the one above give the impression that every neuron is necessarily connected to every other. This certainly need not be the case. In fact, you could also have connections jump over layers–we call these skip-connections. However, beyond simply mentioning that, we will not discuss it further.

Onwards, then. We have a layer that collects inputs. These input layer nodes do something, and pass the results on to so-called hidden layers. Those hidden layers in turn do something, and pass the results forward, until you get outputs. In theory, you could customize what every neuron in every layer does; in practice, that becomes cumbersome, and we only do such customizations layer-wise, meaning that all neurons in one layer will perform the same operation.

What does each neuron do, then? If you see the figure above, each neuron receives several inputs from weighted connections. We specifically used the term weighted connections, because the inputs are not treated equally. So a neuron will first add up all the inputs that it gets, but it will perform a weighted sum. After performing a weighted summation, the neuron ends up with a single number. It then computes a function of that number, and the result of that is the neuron’s final output–the one that it broadcasts to whatever it happens to be connected to at future layers. So loosely, a neuron does this:

$f(x_1, x_2,\ldots, x_n) = g(w_1 x_1 + w_2 x_2 + \ldots + w_n x_n)$

Those $w$ terms are called the weights. Shortly, we will see that it is those weights that we learn using gradient descent. The function $g$ is called the activation function. This name is partly historical: in the earlier days of neural networks, this function gave an output of 1 if the weighted sum was higher than a set threshold, and gave output 0 otherwise, and the neuron was “activated” if the weighted sum was above that threshold.

So given inputs, the input layer neurons will forward the inputs to the first hidden layer, which compute a weighted sum, compute the activations (the results of the activation function), and pass these to the second hidden layer. The neurons in this layer will in turn do the same thing, and so on until you get outputs. So far so good. This process is called forward propagation, and is the first step used in gradient descent while training a neural network. Recall how in algorithms like linear regression, you had to compute the output of the model, then compute the gradients, then update the weights. We do exactly the same thing here, except that computing the model outputs is a more elaborate process.

We will not use the notation above, though, because it gets very confusing very quickly what weights and what inputs we’re talking about. Here’s the notation we will use instead. $g(\cdot)$ will still represent the activation function; but since the activation used can be different at each layer, we will be explicit about that, and write $g^{[l]}(\cdot)$ to denote the activation at layer $l$. We won’t actually care about the outputs of individual neurons; rather, we will look at the outputs of an entire layer (which will of course, be a vector). We will represent the weighted sums computed by the neurons at a layer by $z^{[l]}$, and the outputs (the activations) by $a^{[l]}$. The inputs will simply be denoted by the vector $x$, and to simplify things, we let $a^{[0]} = x$. At each layer, the weights form a matrix, where the first row corresponds to the weights of the outputs from the first neuron, the second row corresponds to the weights of the outputs from the second neuron, and so on. We will represent this by $W^{[l]}$. We will denote the number of layers by $L$, and the number of neurons at layer $l$ by $n^{[l]}$. What do we mean by “number of layers”? In the diagram above, how many distinct layers of neurons do you see? Four, right? But of course it wouldn’t be that easy! The number of layers there are three, because the layer of inputs aren’t counted as an actual layer (since those neurons aren’t really doing anything).

Alright, so let’s put that notation to use. At a given layer $l$, the computations performed are:

\begin{aligned} z^{[l]} &= W^{[l]}a^{[l-1]} + b^{[l]} \\ a^{[l]} &= g(z^{[l]}) \end{aligned}

This is because $W^{[l]}$ has dimensions $(n^{[l]}, n^{[l-1]})$, and $a^{[l]}$, $z^{[l]}$, and $b^{[l]}$ have the dimensions $(n^{[l]}, 1)$.

Back to forward propagation now. We first compute weighted sums, now represented as a matrix multiplication, and then compute the activations from those weighted sums. But what are those $b$ terms? Recall that in linear regression, we would add a $x_0=1$ term so that we could add a constant and that would allow us to write $\hat{y} = w^T x$? In neural networks, we don’t add that extra input, and explicitly represent that constant term–we call this constant term the bias. Typically, all the neurons in a layer use the same value of the bias, so $b^{[l]}$ is technically a real number, but because the result of the multiplication $W^{[l]}a^{[l-1]}$ is a vector, we simply create a vector of the same dimensions, where every value is that same bias. Now, given all this information, let’s recap what happens at each layer once again, so it really sticks in your head:

• Each neuron computes a weighted sum of its inputs.
• To this weighted sum, it adds a bias.
• It finally computes an activation function of the sum computed above.

The above summarizes what forward propagation in a neural network does. You do this for all layers until you get the outputs.

Now that we have the outputs, we need to perform the next step in gradient descent: compute the gradients. We do this via backpropagation. Let’s cover this now.

## Backpropagation

Let’s start off by being clear what we want to achieve here. What are we computing the gradients of, exactly? Well, we can’t change the inputs. We certainly cannot change the activation function that is used. All that’s left then, are the weights and the biases. Note how these are key to the output produced by the network (these, and the activation, of course, but more on that later). These are what we compute the gradients with respect to, and then update in gradient descent. Therefore, the parameters of a neural network are the weights and the biases.

We can’t directly compute the gradients with respect to the weights in the first few layers. Because of the way that we computed the outputs, left to right, we have to compute the gradients in the reverse order–from right to left–and that’s what gives this step its name. Let’s now discuss how we compute the gradients. Essentially, we simply use the chain rule. Here’s how we compute the gradients with respect to the weights in the last layer:

$\frac{\partial L}{\partial W^{[L]}} = \frac{\partial L}{\partial a^{[L]}}\frac{\partial a^{[L]}}{\partial z^{[L]}}\frac{\partial z^{[L]}}{\partial W^{[L]}}$

Similarly, we can continue and compute the gradients with respect to the penultimate layer:

$\frac{\partial L}{\partial W^{[L-1]}} = \frac{\partial L}{\partial a^{[L]}}\frac{\partial a^{[L]}}{\partial z^{[L]}}\frac{\partial z^{[L]}}{\partial a^{[L-1]}}\frac{\partial a^{[L-1]}}{\partial z^{[L-1]}}\frac{\partial z^{[L-1]}}{\partial W^{[L-1]}}$

We simply continue this way till we hit the first layer. Now, at a first glance, this seems very complicated. As you’ll soon see, all of these derivatives are very simple terms, and it becomes quite easy to calculate the gradients that we need for gradient descent. To see that these are indeed simple, let’s consider the problem of binary classification. We will use the binary cross-entropy loss as before. For now, we’ll assume that every activation function is the sigmoid function–a function that we’ve encountered before while discussing logistic regression. We’ll compute all the terms in the equation above, and you should be able to carry on the calculations for more layers if you want to.

\begin{aligned} \frac{\partial L}{\partial a^{[L]}} &= \frac{\partial }{\partial a^{[L]}} \left(-y \log a^{[L]}-(1-y)\log (1-a^{[L]}) \right ) \\ &= -\frac{y}{a^{[L]}}+\frac{1-y}{1-a^{[L]}} \\ &= \frac{-y + ya^{[L]}+a^{[L]}-ya^{[L]}}{a^{[L]}(1-a^{[L]})} \\ &= \frac{a^{[L]}-y}{a^{[L]}(1-a^{[L]})} \end{aligned}

That was the first term. Let’s compute the second term.

\begin{aligned} \frac{\partial a^{[L]}}{\partial z^{[L]}} &= \frac{\partial}{\partial z^{[L]}}\frac{1}{1+\exp(-z^{[L]})} \\ &= a^{[L]}(1-a^{[L]}) \end{aligned}

The second step shouldn’t be new to you: we did this already when discussing logistic regression. Notice how this cancels out the denominator from the previous term. Our next term is also easy to compute:

\begin{aligned} \frac{\partial z^{[L]}}{\partial a^{[L-1]}} &= \frac{\partial}{\partial a^{[L-1]}} \left( W^{[L]}a^{[L-1]}+b^{[L]} \right ) \\ &= W^{[L]} \end{aligned}

The next term is the same as for the last layer, so we won’t repeat that calculation: all you need to do is change the layer numbers in the superscripts. And now the final piece:

\begin{aligned} \frac{\partial z^{[L-1]}}{\partial W^{[L-1]}} &= \frac{\partial}{\partial W^{[L-1]}} \left( W^{[L-1]}a^{[L-2]}+b^{[L-1]} \right ) \\ &= a^{[L-2]T} \end{aligned}

That transpose is a matrix calculus rule. Obviously, the derivatives with respect to the biases end up being 1, so we haven’t done those calculations. To really let the concept sink in, let’s multiply the terms that we derived above and actually write out the gradients of the loss with respect to our parameters.

$\frac{\partial L}{\partial W^{[L]}} = (a^{[L]}-y)a^{[L-1]T}$

The first part here is a combination of the first two terms. For the last part, we needed $\frac{\partial z^{[L]}}{\partial W^{[L]}}$. We did calculate this, but we calculated it for the previous layer. So our result is the same, but we simply change the superscript to reflect the correct layer. If you’re not convinced fully, write the chain rule expression for the gradient with respect to the weights, and compute each term.

Moving on, we have

$\frac{\partial L}{\partial b^{[L]}} = (a^{[L]}-y)$

and

$\frac{\partial L}{\partial W^{[L-1]}} = (a^{[L]}-y)W^{[L]}a^{[L-1]}(1-a^{[L-1]})a^{[L-2]T}$

Okay. That was the last of it. Let’s quickly do a sanity check to make sure that this does make sense. By sanity check, we simply mean ensuring that all the dimensions agree so that the matrix multiplications work. Both the outputs and the targets have dimensions $(n^{[L]}, 1)$, which is thus also the dimension of $a^{[L]} - y$. The dimension of $W^{[L]}$ is $(n^{[L]}, n^{[L-1]})$. So this doesn’t work quite so well. It turns out that the right form for this is:

$\frac{\partial L}{\partial W^{[L-1]}} = \left( W^{[L]T} (a^{[L]}-y) \right) \times \left( a^{[L-1]}(1-a^{[L-1]}) \right) a^{[L-2]T}$

Let’s work out the dimensions here. $W^{[L]T}$ has dimensions $(n^{[L-1]}, n^{[L]})$. $(a^{[L]}-y)$ has dimensions $(n^{[L]}, 1)$, and thus the first part (before the $\times$ symbol) has dimensions $(n^{[L-1]}, 1)$. Now, $a^{[L-1]}(1-a^{[L-1]})$ has dimensions $(n^{[L-1}, 1)$, and the multiplication proceeds smoothly. Note that the $\times$ symbol denotes element-wise multiplication, not a matrix multiplication.

More generally, we have

\begin{aligned} \frac{\partial L}{\partial z^{[l]}} &= \frac{\partial L}{\partial a^{[l]}}\times g^{[l]\prime}(z^{[l]}) \\ \frac{\partial L}{\partial W^{[l]}} &= \frac{\partial L}{\partial z^{[l]}}a^{[l-1]T} \\ \frac{\partial L}{\partial a^{[l-1]}} &= W^{[l]T}\frac{\partial L}{\partial z^{[l]}} \\ \frac{\partial L}{\partial b^{[l]}} &= \frac{\partial L}{\partial z^{[l]}} \end{aligned}

If we had multiple training examples and we vectorized this code, the equations would be pretty much the same, with two changes:

\begin{aligned} \frac{\partial L}{\partial W^{[l]}} &= \frac{1}{m} \frac{\partial L}{\partial z^{[l]}}a^{[l-1]T} \\ \frac{\partial L}{\partial b^{[l]}} &= \frac{1}{m} \sum\limits_i\frac{\partial L}{\partial z^{[l]}_i} \end{aligned}

A much more detailed discussion of backprop is done by Andrew Ng in his Coursera course.