How to solve math puzzles with gradient descent using PyTorch

Math puzzles are an excellent way to improve problem-solving skills and mental agility. Let’s explore how to solve math puzzles using autograd in PyTorch. Autograd is a powerful feature in PyTorch that allows us to calculate gradients automatically, making it an excellent tool for solving mathematical problems.

Autograd

Autograd is an automatic differentiation tool in PyTorch that allows us to calculate the gradients of functions. It works by keeping track of the operations performed on tensors and then using the chain rule of calculus to calculate gradients.

Puzzle

Imagine a network engineer responsible for optimizing the performance of a data transmission system. The engineer needs to analyze the latency and throughput of the network to ensure efficient data transfer. The following equations represent the relationships involved:

A+B=8A + B = 8

In this equation, AA represents the bandwidth (in Mbps) allocated for video streaming, and BB represents the bandwidth (in Mbps) allocated for file downloads. The equation states that the sum of the allocated bandwidths should be 88 Mbps, ensuring that the total available bandwidth meets the required specifications for concurrent video streaming and file downloads.

CD=1C - D = 1

Here, CC represents the latency (in ms) experienced during data transmission, and DD represents the maximum tolerable latency (in ms). The equation indicates that the actual latency should be one millisecond lower than the maximum tolerable latency, ensuring that the data transmission remains within the acceptable delay limits.

A+C=7A + C = 7

This equation states that the sum of the allocated bandwidth and the latency should be equal to 77, ensuring that the network performance meets the desired requirements for video streaming.

BD=2B - D = 2

This equation indicates that the allocated bandwidth for file downloads should be 22 Mbps higher than the maximum tolerable latency, allowing for faster data transfer without exceeding the acceptable delay limits.

Solution

We aim to find the values of variables AA, BB, CC, and DD that satisfy all four equations simultaneously. We can rewrite our equations as follows:

A+B8=0CD1=0A+C7=0BD2=0\begin{align*} A + B - 8 = 0 \\ C - D - 1 = 0 \\ A + C - 7 = 0 \\ B - D - 2 = 0 \\ \end{align*}

Flow chart of the solution using gradient descent
Flow chart of the solution using gradient descent

To solve these equations, we formulate an optimization problem. We define a loss function LL that measures the error between the left-hand side and right-hand side of each equation. The loss function is defined as the sum of the squared errors of all equations. The squared error is used to ensure that both positive and negative errors contribute to the overall loss.

L(A,B,C,D)=(A+B8)2+(CD1)2+(A+C7)2+(BD2)2L(A,B,C,D) = (A + B - 8)^2 + (C - D - 1)^2 + (A + C - 7)^2 + (B - D - 2)^2

A,B,C,D=arg minA,B,C,DL(A,B,C,D)A,B,C,D = \argmin_{A,B,C,D}L(A,B,C,D)

Following that, our objective is to reduce the error to zero, enabling us to determine the optimal values for the variables that fulfill the equations. To accomplish this, we employ the gradient descent optimization algorithm, which minimizes the loss function and identifies the values of AA, BB, CC, and DD that result in the least amount of error.

Implementation of PyTorch’s gradient descent

Let’s see how we use PyTorch to implement our approach and determine the values of the variables we want. In this code widget, we use a built-in optimizer, stochastic gradient descent (SGD), from the PyTorch library to minimize the loss function.

import torch
import random
random.seed(42)
# random initialization of tensors
A = torch.tensor(random.random(), requires_grad=True)
B = torch.tensor(random.random(), requires_grad=True)
C = torch.tensor(random.random(), requires_grad=True)
D = torch.tensor(random.random(), requires_grad=True)
# learning rate
lr = 0.1
# defining the optimizer
optimizer = torch.optim.SGD([A, B, C, D], lr=lr)
while (True):
optimizer.zero_grad()
y1 = A + B - 8
y2 = C - D - 1
y3 = A + C - 7
y4 = B - D - 2
# loss function
loss = y1 * y1 + y2 * y2 + y3 * y3 + y4 * y4
# calculation of gradients
loss.backward()
# updating the variables
optimizer.step()
# setting the threshold
if loss < 1e-10:
break
else:
print(loss)
print(f"The value of A is {A.item()}")
print(f"The value of B is {B.item()}")
print(f"The value of C is {C.item()}")
print(f"The value of D is {D.item()}")

Note: The bandwidth allocated for video streaming is 5.2790 Mbps, and the bandwidth allocated for file download is 2.721 Mbps. The latency experienced during data transmission is 1.721 ms, and the maximum tolerable latency is 0.721 ms.

Code explanation

  • Lines 13–14: We initialize the optimizer stochastic gradient descent (SGD) with a learning rate of 0.1.

  • Lines 18–27: We use the while loop that runs until the loss falls below a threshold of 1e-10. In each iteration of the loop, the gradients are reset using optimizer.zero_grad(). Then, it calculates the values of y1, y2, y3, and y4 based on the current values of A, B, C, and D. The loss is calculated as the sum of squared differences between the calculated y-values and their target values (8, 1, 7, and 2, respectively).

  • Lines 30–33: We compute the gradients using loss.backward(), and the optimizer updates the values of A, B, C, and D using optimizer.step().

  • Lines 36–44: If the loss falls below the threshold, the loop breaks, and the final values of A, B, C, and D are printed. Additionally, during the optimization process, we print the value of the loss to track its gradual decrease until it reaches the threshold.

Implementation of gradient descent from scratch

The code from the widget above remains unchanged, except for the implementation of gradient descent, which we’ll now do from scratch instead of using the SGD optimizer provided by PyTorch. The custom gradient descent implementation is described in the code below:

import torch
import random
random.seed(42)
A = torch.tensor(random.random(), requires_grad=True)
B = torch.tensor(random.random(), requires_grad=True)
C = torch.tensor(random.random(), requires_grad=True)
D = torch.tensor(random.random(), requires_grad=True)
lr = 0.1
while (True):
y1 = A + B - 8
y2 = C - D - 1
y3 = A + C - 7
y4 = B - D - 2
sq_err = y1*y1 + y2*y2 + y3*y3 + y4*y4
sq_err.backward()
# Gradient of the variables
dA, dB, dC, dD = A.grad.data, B.grad.data, C.grad.data, D.grad.data
# Updating the variables
with torch.no_grad():
A -= lr*dA
B -= lr*dB
C -= lr*dC
D -= lr*dD
[i.grad.data.zero_() for i in [A, B, C, D]]
print(sq_err)
if sq_err < 1e-10:
break
else:
print(sq_err)
print(A,B,C,D)

Code explanation

  • Line 21: We extract the gradients of A, B, C, and D using the grad.data attribute. These gradients were computed in the previous step using backpropagation.

  • Lines 24–28: The values of A, B, C, and D are updated using gradient descent. We compute the new values by subtracting the product of the learning rate (lr) and their respective gradients (dA, dB, dC, and dD) from the variables. This step is performed within a torch.no_grad() context to ensure that no gradients are computed during this update.

  • Line 29: After updating the variables, the gradients of A, B, C, and D need to be cleared before the next iteration. We use list comprehension, which iterates over the variables, and call the zero_() method on their gradient data tensors to set them to zero.

Pros of gradient descent

  • It can find approximate solutions even in cases where an exact solution is almost impossible.

  • It is a general-purpose optimization algorithm that can be applied to a wide range of machine learning problems.

  • It can be parallelized, which means it can take advantage of multicore processors or distributed computing frameworks to speed up the optimization process. This is particularly useful when dealing with massive datasets or complex models.

Cons of gradient descent

  • Gradient descent may not guarantee to find the global minimum and may converge to a local minimum instead.

  • The convergence of gradient descent depends on the choice of the learning rate and the initial values of AA, BB, CC, and DD. An inappropriate learning rate or starting point may result in slow convergence or getting stuck at local minima.

  • The method relies on numerical optimization, which involves computing gradients and performing iterative updates. This can be computationally expensive for large systems of equations or complex functions.

Conclusion

By applying gradient descent optimization, we successfully minimized the loss function and found optimal values for AA, BB, CC, and DD. This ensures efficient data transfer by balancing bandwidth allocation for video streaming and file downloads while meeting latency requirements. The power of gradient descent extends to optimizing models and solving real-world problems in data science, machine learning, and optimization domains.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved