Value iteration vs policy iteration

In reinforcement learning, Markov decision processes (MDPs) help in decision-making problems. Such problems include finding an optimal policy, where states are mapped to actions to maximize the overall reward over time. To tackle the reward optimization problem, there are two approaches:

  • Value iteration

  • Policy iteration

In this Answer, we will discuss the algorithms mentioned above and delve into their differences as well.

Value iteration


Value iteration is a dynamic programmingIn which a problem is broken down to subproblems and then the subproblems help in finding the overall solution of the problem. algorithm in which an agent interacts with its surroundings through actions to maximize long-term reward. It considers the neighboring states and refines the estimates of the states in the future. Value iteration starts with initial random estimates and improves until it converges to the optimal values.

Mathematical intuition

The value iteration equation is defined as:

  • HereV(s)V(s) is the value at statess.

  • maxmax select the best action to find the optimal solution.

  • T(s,a,s)T(s, a, s') is the probability for the movement of an agent from statess toss' by taking an actionaa.

  • R(s,a,s)R(s, a, s') is the reward of an agent when it moves from state ss to ss'.

  • γγ represents the discount factor that determines the significance of long-term rewards as compared to short-term rewards.

  • V(s)V(s') is the value of the next state ss'.

The pseudocode for value iteration is:

In the value iteration algorithm, the input is the transition probabilities
represented by T, discount factor γ,states S and the actions A.
The output (result) is the optimal policy that maximizes the reward and
the value function.
Declaring and intializing the value of all states with 0.
V(s) = 0, ∀ s ∈ S
Repeat until the path is optimal and the function converges:
For every state s that belongs to S:
Find the values H(s, a) against all the actions such that a ∈ A:
H(s, a) = Σ(T(s' | s, a) * (R(s) + γ * V(s'))), for all s' ∈ S
For the current state, update the value function:
V(s) = max(H(s, a)), ∀ a ∈ A
After the update of a value function, we need to update the optimal policy
For all the state s ∈ S:
π*(s) = argmax(H(s, a)), ∀ actions a ∈ A
At last we have the optimal policy π* and optimal value function V*.
Pseudo code for value iteration

Policy iteration

Policy iteration is an iterative method that alternates between evaluating and improving a policy until an optimal policy is found.

Mathematical intuition

There are two parts of policy iteration, which are:

  • Policy evaluation

  • Policy improvement

In policy evaluation, we evaluate V(s) for the current policy π(s) until it converges to the optimal solution.

  • T(s,π(s),s)T(s, π(s), s') is the probability of transition from statess to statess' when π(s) is given.

  • R(s,π(s),s)R(s, π(s), s') is the short-term or immediate reward from the statess toss', given that action is described byπ(s)π(s).

Policy improvement aims to find a better policy after evaluating the current policy. It helps in improving the expected cumulative reward.

  • π(s)π'(s) is the improved policy state thess.

  • argmaxargmax selects the maximum cumulative reward for taking action a.

The pseudocode for policy iteration is:

The input and output of the policy iteration is same as of the
Value iteration.
Declare and initialize the random policy π
Repeat until the policy converges:
This section emphasize the code of policy evaluation.
Repeat until convergence:
For each state s that belongs to S:
Compute the action values H(s, a) for all actions a ∈ A:
H(s, a) = Σ(T(s' | s, a) * (R(s) + γ * V(s')))
Update the value function estimate:
V(s) = H(s, π(s))
This section provides the code for policy improvement.
bool IspolicyStable = true
For each state s ∈ S:
previousAction = π(s)
Compute the action values Q(s, a) for all actions a ∈ A:
H(s, a) = Σ(T(s'| s, a) * (R(s) + γ * V(s')))
Update the policy here:
π(s) = argmax(H(s, a))
if previousAction != π(s):
IspolicyStable = false
if IspolicyStable == true:
break the loop here;
Return the optimal value function V* and optimal policy π*
Pseudo code for policy iteration.

Difference

The difference table between policy iteration and value iteration is given below:

Value Iteration

Policy Iteration

Approach

Directly computes optimal values.

Alternates between policy evaluation and policy iteration.

Steps

It uses the Bellman optimal equation.

First, it evaluates the policy and then it improves the policy.

Intermediate Policies

It does not explicitly generate intermediate policies.

In each iteration, it generates intermediate policies.

Convergence Criteria

Values converge to their optimal values.

Policy no longer changes between iterations.

Computational Efficiency

Fewer iterations but more value function evaluations.

May require more iterations but fewer value function evaluations.

Application

Suitable for cases with expensive value function updates.

Generally more computationally efficient.

Conclusion

In this Answer, we have discussed value iteration and policy iteration and have explained both with mathematical intuitions. We also looked at pseudocodes to understand them in a better way.

Note: Follow this link to learn more about reinforcement learning.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved