Activation functions play a deterministic role in making sense of each neuron's output within a neural network. Depending on the data semantics and the classification model's nature, we use activation functions differently within each layer.
In this Answer, we discuss the Swish activation function.
The Swish function, or the self-gated function, is just another activation function proposed by Google. It has been proposed as a possible improvisation to the already existing sigmoid function.
We define the Swish activation function as follows:
Swish activation is used in the Long Short-Term Memory (LTST) neural networks, which are used extensively in sequence prediction and likelihood problems.
As per claims made by Google in their propositional paper, the Swish activation function performs better than ReLU in the case of deeper neural networks that are trained on datasets with interlinked and complex patterns.
Furthermore, the paper outlines that replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A, and 0.6% for Inception-ResNet-v2. Hence, it backs up experimentally that despite the hyperparameters tuned for ReLU, the network exhibits greater accuracy with a simple change to the Swish activation function.
Most activation functions are described to be monotonic. In the comparison between Swish and ReLU, it is notable that in a particular region of increasing, negative x-coordinates, the corresponding activations decrease and form a slight concavity in the graph pattern. This bypasses the monotonic property for certain trends, and highlights a unique feature of Swish.
We compute the gradient of the Swish function as follows:
The graph of the gradient is shown below:
Here are some of the advantages of using a Swish function:
Here are some of the limitations of the Swish function:
Free Resources