Neural networks are powerful algorithms for tasks such as recognizing our fluffy companions (Bistro), playing piano (Google), and absolutely crushing human beings at chess (AlphaZero). How cool is that?

However, some of the difficulties in using neural networks is the sheer number of parameters for a neural network. What model? Convolutional? LSTM? RNN? How many nodes? How many layers? What activation function? Too many choices is a real thing.

This post will analyze some of the most popular and powerful activation functions out there, and there is a lot. But one of the most important question is:

What is an Activation Function?

Activation functions change the outputs coming out of each layer of a neural network. For example, in the image below, we apply something called a sigmoid function $(y = \frac{1}{1+e^{-x}})$ to the nodes’s input which results in a different output. But why?

Sigmoid Graph

Activation functions can regulate the outputs of nodes and add a level of complexity that neural networks without activation functions cannot achieve. If you prefer comparing neural networks to neurons, then activation functions are similar to how the neuron doesn’t just spit out the sum of electrical signals it’s receiving, but strengthen or weaken the output signal based on the strength of the output synaptic connection.

Notation

  • $x$ refers to the input of a node (equal to $\sum wx + b$)
  • $f(x)$ refers to the output of the node after applying an activation function
  • $f'(x)$ is the derivative of the function

Linear

$$f(x) = cx$$$$f’(x) = c$$

Linear Graph

This function scales the input by $c$. If $c = 1$, the input is equal to the output, becoming an identity function.

  • Pros
    • Simplistic
    • No limit on what the output can be
    • That’s pretty much it
  • Cons
    • Requires changing c parameter
    • Lacks complexity
      • This means that we would not be able to apply the neural network to more sophisticated tasks
      • Furthermore if you really sat down and thought about it, a three layer network has the same level of complexity as a one layer network. Since the outputs are linear in nature, the final output is just a linear function of the first layer.
    • Derivative is constant
      • The neural network would always correct by a constant amount, crippling accuracy

Threshold (Binary Step)

\[ f(x)= \begin{cases} 0 &\text{$x<0$}\\[2ex] 1 &\text{$x\geq 0$} \end{cases} \]

Threshold Graph

This activation function is best used to classify inputs such as pictures of cats (so fluffy!) and birds and differentiating between the two. However, it should only be used at the output nodes of the neural network, not the hidden layers.

  • Pros
    • Best for black and white situations (“yes” or “no”, 1 or 0)
  • Cons
    • Real life isn’t that black and white and most situations are not this binary.
    • Can only be used at the end of a network.
    • Best for perceptrons and one layer networks

Sigmoid (Logistic)

$$f(x) = \frac{1}{1+e^{-x}}$$$$f’(x) = \frac{1}{1+e^{-x}} \left(1+\frac{1}{1+e^{-x}}\right)$$

Sigmoid Graph

This S-shaped function has proven to work great for three layer and two layer neural networks, particularly classification problems. Notice the hill-shaped derivative of the function which pushes the network to “move down the hill” to either side, giving the network more distinction when classifying.

The function also “squashes” the input and limit it to between 0 and 1 (similar to binary step functions which can take values between 0 and 1, granted its only 0 or 1)

  • Pros
    • Amazing for classification problems
    • Nonlinear
      • This nonlinearity gives the network more complexity and allows us to use it for more difficult tasks
    • Used in deep learning models (at only the output though and for good reasons (see cons))
  • Cons
    • Saturation
      • Weights are saturated when the neuron mostly outputs values to either ends of the activation function (in this case 0 and 1).
      • Take for example if the neuron’s input is very negative, then the output after the activation function would be close to 0, which would cause gradient to be almost nothing. As such the neuron would be saturated, and would not learn.
      • This cripples training and information capacity which is why the next activation function, tanh, is preferred
    • Vanishing Gradient Problem
      • If you graph $f’(x)$, you can see that the output is between 0 and 1. Imagine having a n-layer neural network, if you use the sigmoid function for each layer, then the gradient, as the signal is backpropagated, would get smaller and smaller.
      • Why? Backpropagation use the chain rule which has the effect of multiplying n of these small fractions to compute gradients which decreases them exponentially, becoming nearly equivalent to 0. This would mean that the first layer has almost no gradient which would paralyze the network from learning.

Tanh

$$f(x) = \frac{2}{1+e^{-2x}} - 1 = \tanh(x)$$$$f’(x) = 1 – \tanh^2(x)$$

Tanh Graph

If you think there are similarities between tanh and sigmoid, you are absolutely right! The tanh function is just the sigmoid function scaled up. Instead of outputs being from 0 to 1 it is from -1 to 1. It is mainly used in LSTM networks.

  • Pros
    • Less susceptible to saturation
      • Why? Take the same example as before with the neuron’s input being a very negative value. This would result in the output after the activation function being -1 instead of 0. This means the gradient is not 0 anymore, allowing the neuron to learn.
    • Higher derivative value means the neuron is able to differentiate between similar situations better
  • Cons
    • Still susceptible to the vanishing gradient problem. RIP

ReLU

$$f(x) = max(0,x)$$\[ f'(x)= \begin{cases} 0 &\text{$x<0$}\\[2ex] 1 &\text{$x\geq 0$} \end{cases} \]

ReLU Graph

ReLU is one of the most popular activation functions out there and is commonly used in deep learning neural networks for speech recognition and computer vision. The activation function is surprisingly simple: the output is 0 if the input is negative and return the input unchanged if the input is positive.

  • Pros
    • Not susceptible to the vanishing or exploding gradient problem
    • One-sided which mimics biological neurons
    • Sparse activation (when only about 50% of the neurons fire) which also mimics how only a fraction of neurons in our brain are active at one time
    • Efficient computation – it’s a really simplistic formula which helps in computation time (it really stacks up after days of training)
      • Scale Invariant – amax(0,x) = max(0,ax)
        • Gives the activation function universality
  • Cons
    • Wow that was a ton of pros. Surely there must be strings attached. (Spoiler: there is)
    • Dying ReLU
      • Sounds morbid, it is.
      • The problem is that if the input is sufficiently negative, the output will always be 0 and as such, the gradient will always be 0. This is a massive problem because it essentially “kills” neurons and prevent them from learning
      • It is so bad that sometimes 40% of a network can die off.

Leaky ReLU

$$f(x) = max(-ax, x) \text{ where $a$ is a small value}$$$$f’(x) = -a \text{ if $x < 1$ and $1$ if $x > 1$}$$

Leaky ReLU Graph

This is an attempt to fix the dying ReLU problem as the gradient becomes a small value, -a, instead of 0. Some report that it does help fix this problem but this is not consistent. ReLU is still used by default.

  • Pros
    • Kinda help with dying ReLU???
  • Cons
    • Results are not consistent

Swish

$$f(x) = x\times \frac{1}{1+e^{-x}}$$$$f’(x) = \frac{1}{1+e^{-x}}+\frac{e^{-x}x}{\left(1+e^{-x}\right)^2}$$

Swish Graph

This new activation function was developed by Google in Oct 2017 which improves classification accuracy by 0.9% for ImageNet, 0.9% for Mobile NASNet-A, and 0.6% for Inception-ResNet-v2 (all are image classification systems).

In benchmark tests, Swish is proven to be better than ReLU and other linear units.

Swish ReLU Baseline

However, the effectiveness of Swish over ReLU is debatable due to the small increase in accuracy and that Swish is more computationally expensive.

  • Pros
    • No dying ReLU
    • Increase in accuracy over ReLU
    • Outperforms ReLU in every batch size
  • Cons
    • Slightly more computationally expensive
    • More problems with the algorithm will probably arise given time

Conclusion

This is just a brief overview of some of the most well-known activation functions, each of which serves a different utility based on the nature of the network. As with Swish and Tanh, new activation functions are being discovered, replacing older functions, getting closer and closer to the Master Algorithm!