So we’ve discussed what neural networks are, how forward prop and backprop work. The only thing left now is the choice of activation function. We’ve already discussed the sigmoid activation. In practice, modern neural networks don’t use sigmoid in all the layers–typically, only the last layer uses sigmoid activations, and only when the problem is a binary classification one. Similarly, when our problem is multi-class classification, we use the softmax activation function, and the gradients are similar to what we derived when discussing softmax regression.

So what about the middle layers? Remember how in the beginning, we said that one of the reasons that deep neural networks (deep simply means that there are many layers) are popular is the ReLU activation function? Let’s look at it. The ReLU is defined as

So it’s essentially a linear layer, except that the negative side has been “rectified”–forced to zero. That one change makes this function non-linear, and make neural networks much better. And if you’re wondering why, so is everyone else. At this point, we’re still understanding how and why neural networks work as well as they do, and it’s an active research area. In my opinion, this recent paper does a good job at explaining the success of ReLU. Essentially, it says that as you build deeper networks using the ReLU activations, you can get many more piece-wise linear regions. So suppose you had a complex function. This could be a decision boundary in classification problems, or the function you want to regress over (for regression problems). Now any function can be approximated arbitrarily well using many piece-wise linear functions. So the deeper you build your network, the more piece-wise linear regions you get, and the better you can approximate your function.

So why couldn’t we have just used a simple linear activation function, ? Let’s see what happens after two layers with this activation:

The first term is simply some new matrix, say , times . And the second term is a vector whose dimensions equal , the same as . So even with two layers, we’ve ended up with the equivalent of what you’d do with just one layer.

Now you might notice that the ReLU activation isn’t differentiable at 0, which is a problem since we need to compute its gradient. During implementation, we arbitrarily set it to either 0 or 1, and it doesn’t really matter what you choose.

There are other activation functions as well:

- is sometimes used instead of sigmoid

- A recently proposed activation by Dr. Snehanshu Saha is the Saha-Bora activation function (SBAF): . This function is flexible because you can change its shape with its two parameters, and under certain conditions, it can approximate the sigmoid and the ReLU activations rather well.
- Another proposed activation is the A-ReLU (approximate ReLU), . The parameters of this function can be tweaked to approximate ReLU, while making sure that the function is everywhere differentiable.
- Another activation function sometimes used is the Exponential Linear Unit (ELU). It is defined as

## Vanishing gradient problem

Let’s look at the sigmoid activation function now. Specifically, let’s look at its derivative.

What happens when gets large? Then , and therefore , and we have . Similarly, when is very small (that is, high in magnitude, but negative), then . This is a problem. But there’s another problem. What’s the maximum value of the gradient? A little calculus helps here:

You should compute the second derivative of to convince yourself that this is a maximum. Therefore, the maximum value of the gradient is . So as we perform backprop, layer by layer, these gradients will get multiplied and thus get smaller and smaller. The deeper the network, the smaller the gradients, and the bigger this problem. This problem is called the **vanishing gradients problem**. This is precisely why networks using only the sigmoid activation have trouble learning well.

ReLU doesn’t suffer from this problem technically, since its gradients are either 0 or 1. But it has its own curious problem, called the **dying ReLU problem**. In this situation, all the inputs to a particular neuron are such that the output is 0, and so the gradient is also 0 and this neuron never really learns (since during weight update, the gradient is 0). This is rare, but happens rarely and can render a network less effective.

To make sure this never happens, an activation function called the **leaky ReLU** was proposed. Rather than force the output to 0 on the negative side, we allow a small amount to “leak”. So when , , and we could use any constant instead of 0.01. Now how do we choose that constant? We could certainly try some values and see what seems to work best. Alternatively, we could learn that constant using gradient descent! That’s exactly what Kaiming He et al. proposed in 2015, in their stunning paper, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. Beyond mentioning that this is how PReLU (Parametric ReLU) works, however, we won’t discuss it further. Like we said before, we can’t go down every rabbit hole.

## Wrapping up

Now we’ve seen a couple of other activation functions and when they’re used. Next, we’ll look at some other aspects of training and working with neural networks.