What is the Swish activation function?

What are activation functions?

Activation functions play a deterministic role in making sense of each neuron's output within a neural network. Depending on the data semantics and the classification model's nature, we use activation functions differently within each layer.

In this Answer, we discuss the Swish activation function.

The Swish activation function

The Swish function, or the self-gated function, is just another activation function proposed by Google. It has been proposed as a possible improvisation to the already existing sigmoid function.

We define the Swish activation function as follows:

Swish activation is used in the Long Short-Term Memory (LTST) neural networks, which are used extensively in sequence prediction and likelihood problems.

As per claims made by Google in their propositional paper, the Swish activation function performs better than ReLU in the case of deeper neural networks that are trained on datasets with interlinked and complex patterns.

Furthermore, the paper outlines that replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A, and 0.6% for Inception-ResNet-v2. Hence, it backs up experimentally that despite the hyperparameters tuned for ReLU, the network exhibits greater accuracy with a simple change to the Swish activation function.

ReLU vs. Swish

Most activation functions are described to be monotonic. In the comparison between Swish and ReLU, it is notable that in a particular region of increasing, negative x-coordinates, the corresponding activations decrease and form a slight concavity in the graph pattern. This bypasses the monotonic property for certain trends, and highlights a unique feature of Swish.

The Swish function's gradient

We compute the gradient of the Swish function as follows:

The graph of the gradient is shown below:

Advantages of the Swish function

Here are some of the advantages of using a Swish function:

  • The function allows data normalization and leads to quicker convergence and learning of the neural network.
  • It works better in deep neural networks that require LSTM, compared to ReLU.
  • With deeper neural networks requiring minor updates to the gradient during backpropagation, the update is not enough. This leads to the vanishing gradient problem in the case of the sigmoid and ReLU activation functions. Swish can work around and prevent the vanishing gradient program, and hence allow training for small gradient updates.

Limitations of the Swish function

Here are some of the limitations of the Swish function:

  • The function is time-intensive to compute for deeper layers with large parameter dimensions.
  • Lower sums of weighted inputs which would want to increase their output would not work because the function has a non-monotonic region for negative values closer to zero.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved