Value iteration vs policy iteration

In reinforcement learning, Markov decision processes (MDPs) help in decision-making problems. Such problems include finding an optimal policy, where states are mapped to actions to maximize the overall reward over time. To tackle the reward optimization problem, there are two approaches:

Value iteration
Policy iteration

In this Answer, we will discuss the algorithms mentioned above and delve into their differences as well.

Value iteration

Value iteration is a dynamic programmingIn which a problem is broken down to subproblems and then the subproblems help in finding the overall solution of the problem. algorithm in which an agent interacts with its surroundings through actions to maximize long-term reward. It considers the neighboring states and refines the estimates of the states in the future. Value iteration starts with initial random estimates and improves until it converges to the optimal values.

Mathematical intuition

The value iteration equation is defined as:

Here $V(s)$ is the value at state $s$ .
$max$ select the best action to find the optimal solution.
$T(s, a, s')$ is the probability for the movement of an agent from state $s$ to $s'$ by taking an action $a$ .
$R(s, a, s')$ is the reward of an agent when it moves from state $s$ to $s'$ .
$γ$ represents the discount factor that determines the significance of long-term rewards as compared to short-term rewards.
$V(s')$ is the value of the next state $s'$ .

The pseudocode for value iteration is:


In the value iteration algorithm, the input is the  transition probabilities 
represented by T, discount factor γ,states S and the actions A.
The output (result) is the optimal policy that maximizes the reward and 
the value function.
Declaring and intializing the value of all states with 0.
V(s) = 0,  ∀ s ∈ S
Repeat until the path is optimal and the function converges:
    For every state s that belongs to S:
        Find the values H(s, a) against all the actions such that a ∈ A:
            H(s, a) = Σ(T(s' | s, a) * (R(s) + γ * V(s'))), for all s' ∈ S
            
        For the current state, update the value function:
            V(s) = max(H(s, a)), ∀ a ∈ A
After the update of a value function, we need to update the optimal policy 
For all the state s ∈ S:
    π*(s) = argmax(H(s, a)), ∀ actions a ∈ A
At last we have the optimal policy π* and optimal value function V*.

Pseudo code for value iteration

The input and output of the policy iteration is same as of the
Value iteration.
Declare and initialize the random policy π
Repeat until the policy converges:
    This section emphasize the code of policy evaluation.
    Repeat until convergence:
        For each state s that belongs to S:
            
            Compute the action values H(s, a) for all actions a ∈ A:
                H(s, a) = Σ(T(s' | s, a) * (R(s) + γ * V(s')))
                
            Update the value function estimate:
            V(s) = H(s, π(s))
    
    This section provides the code for policy improvement.
    bool IspolicyStable = true
    
    For each state s ∈ S:
        previousAction = π(s)
        
        Compute the action values Q(s, a) for all actions a ∈ A:
            H(s, a) = Σ(T(s'| s, a) * (R(s) + γ * V(s')))
            
        Update the policy here:
        
        π(s) = argmax(H(s, a))
        
        if previousAction != π(s):
            IspolicyStable = false
    
    if IspolicyStable == true:
        break the loop here;
Return the optimal value function V* and optimal policy π*

Pseudo code for policy iteration.

	Value Iteration	Policy Iteration
Approach	Directly computes optimal values.	Alternates between policy evaluation and policy iteration.
Steps	It uses the Bellman optimal equation.	First, it evaluates the policy and then it improves the policy.
Intermediate Policies	It does not explicitly generate intermediate policies.	In each iteration, it generates intermediate policies.
Convergence Criteria	Values converge to their optimal values.	Policy no longer changes between iterations.
Computational Efficiency	Fewer iterations but more value function evaluations.	May require more iterations but fewer value function evaluations.
Application	Suitable for cases with expensive value function updates.	Generally more computationally efficient.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

Value iteration vs policy iteration

Value iteration

Mathematical intuition

Policy iteration

Mathematical intuition

Difference

Conclusion