Understanding the equivalence of SGD with momentum equations

Stochastic Gradient Descent (SGD) with momentum is a popular variant of the basic SGD algorithm, which accelerates the convergence toward the minimum of the loss function, especially in directions with persistent gradients.

To understand how different formulations of SGD with momentum are equivalent, let’s first define the basic equations and then delve into their equivalency.

Basic equations of SGD with momentum

The SGD with momentum algorithm updates the parameters θθ of the model by combining the gradient of the loss function θJ(θ)∇_θ​J(θ) with the previous update step. The basic equations are as follows:

  • Momentum update:

    • vt=γvt1+ηθJ(θ)v_t​=γv_{t−1}​+η∇_θ​J(θ): In this equation, vtv_t​ is the current update, γγ is the momentum coefficient (usually between 0 and 1), η is the learning rate, and θJ(θ)∇_θ​J(θ) is the gradient of the loss function.

  • Parameter update:

    • θ=θvtθ=θ−v_t: ​This updates the parameters in the direction of the negative gradient, accelerated by the momentum.

Equivalence of different formulations

Different formulations of SGD with momentum might look different but are essentially equivalent in functionality. Let's consider two common formulations and show their equivalence:

  • Formulation 1:

  • Formulation 2:

To understand how these are equivalent, let’s expand the formulation 2:

  1. The update vtv_t​ is calculated as the weighted sum of the previous update vt1v_{t−1}​ and the current gradient.

  2. The parameter update step multiplies vtv_t​ with the learning rate η.

Expanding the update step of the formulation 2:

This shows that the effect of the learning rate ηη and momentum γγ on the parameter update is the same in both formulations. The first formulation applies the learning rate directly to the gradient before adding to the momentum term. In the second formulation, the learning rate is applied to the entire update vector vtv_t​ after combining the gradient and the previous momentum.

Comparison between the two formulations
Comparison between the two formulations

The graph above compares the parameter updates over iterations for the two different formulations of SGD with momentum. In this demonstration:

  • Formulation 1 uses the equation vt=γvt1+ηθJ(θ)vt​=γv_{t−1}​+η∇_θ​J(θ) and then updates the parameter with θ=θvtθ=θ−v_t​.

  • Formulation 2 uses vt=γvt1+θJ(θ)v_t​=γv_{t−1}​+∇_θ​J(θ) and updates the parameter with θ=θηvtθ=θ−ηv_t​.

The graph shows that both formulations result in the same trajectory for the parameter updates over iterations, demonstrating their functional equivalence. The key takeaway is that despite the slight difference in how the learning rate (η)(η) and momentum coefficient (γ)(γ) are applied, the overall effect on the parameter update process is the same. This equivalence holds true under the assumption of a constant learning rate and momentum coefficient, and it illustrates how momentum helps in smoothing and accelerating the convergence in gradient-based optimization.

Demonstration of SGD with momentum

Let's understand the SGD with momentum with the help of the following code:

import numpy as np
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]]) # Features
y = np.array([3, 5, 7, 9]) # Labels
# Initialize parameters
theta = np.zeros(X.shape[1])
learning_rate = 0.01
momentum = 0.9
iterations = 1000
velocity = np.zeros_like(theta)
# Stochastic Gradient Descent with Momentum
for epoch in range(iterations):
for i in range(len(y)):
# Compute prediction
prediction = np.dot(X[i], theta)
# Compute the gradient
gradient = (prediction - y[i]) * X[i]
# Update velocity
velocity = momentum * velocity - learning_rate * gradient
# Update parameters
theta += velocity
print("Parameters (theta):", theta)

Code explanation

  • Lines 4–5: Create a sample dataset for computing the SGD.

  • Lines 8–12: Initialize different parameters including the theta, learning_rate, momentum, iterations and velocity.

  • Lines 15–27: This segment calculates SGD with momentum with iterations time. Here we use the formulation 1 where learning_rate is multiplied by the gradient. However, both formulations yeild the same results, as discussed.

  • Line 29: We print the parameters theta updated after a specific amount of iterations.

Conclusion

Both formulations of SGD with momentum are equivalent in how they affect parameter updates. The choice between them often depends on personal preference or specific implementation details in different libraries. The key idea of momentum is to combine the current gradient direction with the previous update direction.

This approach smooths out the updates and can lead to faster convergence.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved