In reinforcement learning, Markov decision processes (MDPs) help in decision-making problems. Such problems include finding an optimal policy, where states are mapped to actions to maximize the overall reward over time. To tackle the reward optimization problem, there are two approaches:
Value iteration
Policy iteration
In this Answer, we will discuss the algorithms mentioned above and delve into their differences as well.
Value iteration is a
The value iteration equation is defined as:
Here
The pseudocode for value iteration is:
In the value iteration algorithm, the input is the transition probabilitiesrepresented by T, discount factor γ,states S and the actions A.The output (result) is the optimal policy that maximizes the reward andthe value function.Declaring and intializing the value of all states with 0.V(s) = 0, ∀ s ∈ SRepeat until the path is optimal and the function converges:For every state s that belongs to S:Find the values H(s, a) against all the actions such that a ∈ A:H(s, a) = Σ(T(s' | s, a) * (R(s) + γ * V(s'))), for all s' ∈ SFor the current state, update the value function:V(s) = max(H(s, a)), ∀ a ∈ AAfter the update of a value function, we need to update the optimal policyFor all the state s ∈ S:π*(s) = argmax(H(s, a)), ∀ actions a ∈ AAt last we have the optimal policy π* and optimal value function V*.
Policy iteration is an iterative method that alternates between evaluating and improving a policy until an optimal policy is found.
There are two parts of policy iteration, which are:
Policy evaluation
Policy improvement
In policy evaluation, we evaluate V(s) for the current policy π(s) until it converges to the optimal solution.
Policy improvement aims to find a better policy after evaluating the current policy. It helps in improving the expected cumulative reward.
The pseudocode for policy iteration is:
The input and output of the policy iteration is same as of theValue iteration.Declare and initialize the random policy πRepeat until the policy converges:This section emphasize the code of policy evaluation.Repeat until convergence:For each state s that belongs to S:Compute the action values H(s, a) for all actions a ∈ A:H(s, a) = Σ(T(s' | s, a) * (R(s) + γ * V(s')))Update the value function estimate:V(s) = H(s, π(s))This section provides the code for policy improvement.bool IspolicyStable = trueFor each state s ∈ S:previousAction = π(s)Compute the action values Q(s, a) for all actions a ∈ A:H(s, a) = Σ(T(s'| s, a) * (R(s) + γ * V(s')))Update the policy here:π(s) = argmax(H(s, a))if previousAction != π(s):IspolicyStable = falseif IspolicyStable == true:break the loop here;Return the optimal value function V* and optimal policy π*
The difference table between policy iteration and value iteration is given below:
Value Iteration | Policy Iteration | |
Approach | Directly computes optimal values. | Alternates between policy evaluation and policy iteration. |
Steps | It uses the Bellman optimal equation. | First, it evaluates the policy and then it improves the policy. |
Intermediate Policies | It does not explicitly generate intermediate policies. | In each iteration, it generates intermediate policies. |
Convergence Criteria | Values converge to their optimal values. | Policy no longer changes between iterations. |
Computational Efficiency | Fewer iterations but more value function evaluations. | May require more iterations but fewer value function evaluations. |
Application | Suitable for cases with expensive value function updates. | Generally more computationally efficient. |
In this Answer, we have discussed value iteration and policy iteration and have explained both with mathematical intuitions. We also looked at pseudocodes to understand them in a better way.
Note: Follow this link to learn more about reinforcement learning.
Free Resources