Stochastic Gradient Descent (SGD) with momentum is a popular variant of the basic SGD algorithm, which accelerates the convergence toward the minimum of the loss function, especially in directions with persistent gradients.
To understand how different formulations of SGD with momentum are equivalent, let’s first define the basic equations and then delve into their equivalency.
The SGD with momentum algorithm updates the parameters
Momentum update:
Parameter update:
Different formulations of SGD with momentum might look different but are essentially equivalent in functionality. Let's consider two common formulations and show their equivalence:
Formulation 1:
Formulation 2:
To understand how these are equivalent, let’s expand the formulation 2:
The update
The parameter update step multiplies
Expanding the update step of the formulation 2:
This shows that the effect of the learning rate
The graph above compares the parameter updates over iterations for the two different formulations of SGD with momentum. In this demonstration:
Formulation 1 uses the equation
Formulation 2 uses
The graph shows that both formulations result in the same trajectory for the parameter updates over iterations, demonstrating their functional equivalence. The key takeaway is that despite the slight difference in how the learning rate
Let's understand the SGD with momentum with the help of the following code:
import numpy as np# Sample dataX = np.array([[1, 2], [2, 3], [3, 4], [4, 5]]) # Featuresy = np.array([3, 5, 7, 9]) # Labels# Initialize parameterstheta = np.zeros(X.shape[1])learning_rate = 0.01momentum = 0.9iterations = 1000velocity = np.zeros_like(theta)# Stochastic Gradient Descent with Momentumfor epoch in range(iterations):for i in range(len(y)):# Compute predictionprediction = np.dot(X[i], theta)# Compute the gradientgradient = (prediction - y[i]) * X[i]# Update velocityvelocity = momentum * velocity - learning_rate * gradient# Update parameterstheta += velocityprint("Parameters (theta):", theta)
Lines 4–5: Create a sample dataset for computing the SGD.
Lines 8–12: Initialize different parameters including the theta
, learning_rate
, momentum
, iterations
and velocity
.
Lines 15–27: This segment calculates SGD with momentum with iterations
time. Here we use the formulation 1 where learning_rate
is multiplied by the gradient
. However, both formulations yeild the same results, as discussed.
Line 29: We print the parameters theta
updated after a specific amount of iterations
.
Both formulations of SGD with momentum are equivalent in how they affect parameter updates. The choice between them often depends on personal preference or specific implementation details in different libraries. The key idea of momentum is to combine the current gradient direction with the previous update direction.
This approach smooths out the updates and can lead to faster convergence.
Free Resources