Implementation of Tanh activation function in PyTorch

Activation functions in neural networks decide whether a neuron should be activated, helping the network learn complex patterns during training.

Tanh function

The hyperbolic tangent function (Tanh) is a popular activation function in neural networks and deep learning. It’s a scaled and shifted version of the Sigmoid function. Like Sigmoid, it’s also s-shaped, but instead of having an output range of00to1,1, Tanh has an output range of 1-1 to 11. This means it centers the output of the function, which can lead to better performance in certain neural network architectures by improving the convergence during training.

Mathematical formula

The mathematical formula for the Tanh function is as follows:

It takes any real value as input and output values in the range of 1-1 to 11. This means that the Tanh function centers the data around 00, which can help in certain models where negative input relevancy is as important as positive. Its shape is very similar to the Sigmoid function but stretched vertically.

Tanh function
Tanh function

The above graph describes the hyperbolic tangent (Tanh) function, mapping inputs between10-10and1010to outputs from1-1 to 11. The curve is s-shaped, transitioning smoothly from 1-1 to 11, and is symmetrical about the origin, reflecting the zero-centered nature of the Tanh function. This zero-centering is a desirable property in neural networks, as it tends to aid in the convergence of gradient descent during training by evenly distributing the outputs around zero.

Implementation of Tanh in neural network

Let’s see the implementation of the Tanh activation function in the following neural network using PyTorch.

import torch
import torch.nn as nn
# Define the neural network architecture
class SimpleNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size) # Fully connected layer 1
self.tanh = nn.Tanh() # Tanh activation function
self.fc2 = nn.Linear(hidden_size, output_size) # Fully connected layer 2
self.sigmoid = nn.Sigmoid()
def forward(self, x):
out = self.fc1(x) # Apply the first fully connected layer
out = self.tanh(out) # Apply the Tanh activation function
out = self.fc2(out) # Apply the second fully connected layer
out = self.sigmoid(out) # Apply the Sigmoid activation function at the end
return out
# Define network parameters
input_size = 64 # Number of input features
hidden_size = 128 # Number of neurons in the hidden layer
output_size = 2 # Number of output classes
# Input data
input_data = torch.rand(32, input_size) # 32 is the batch size
target = torch.randint(0, 2, (32, output_size), dtype=torch.float32) # Random binary target values
# Create an instance of the SimpleNN model
model = SimpleNN(input_size, hidden_size, output_size)
# Define the loss function (binary entropy loss) and optimizer (Adam)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Example training loop
num_epochs = 10 # Define the number of training epochs
for epoch in range(num_epochs):
# Forward pass
outputs = model(input_data)
loss = criterion(outputs, target) # Compute the loss
# Backward pass and optimization
optimizer.zero_grad() # Clear gradients
loss.backward() # Backpropagate to compute gradients
optimizer.step() # Update the model parameters
# Print the loss for each epoch
print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}')

Code explanation

  • Line 9: We create an instance of the Tanh activation function using the nn.Tanh() command, and store it as an attribute of the SimpleNN class named self.tanh.

  • Line 15: We apply the Tanh activation function to the output of the first fully connected layer. It maps input values to a range from -1 to 1.

For more details on how to build a neural network take a look at this Educative Answer.

Insights for machine learning engineers:

  • While Tanh is generally superior to the Sigmoid function because it is zero-centered, it still suffers from the vanishing gradient problem for extreme values of input. This can make training deep networks with many layers tricky.

  • Even though ReLU and its variants are more popular in deep networks due to their efficiency, Tanh is still prevalent in tasks that benefit from considering both positive and negative inputs distinctly, like in certain NLP tasks or when data is strictly bounded and you need to maintain a zero mean.


Free Resources

HowDev By Educative. Copyright ©2025 Educative, Inc. All rights reserved