1. Introduction
For small or perfectly known environments, we can solve MDPs directly using dynamic programming . In practice, these “Exact” methods are often infeasible for larger environments, but they remain foundational for understanding the mechanics behind more advanced RL.
In this lesson, you’ll learn:
-
How Value Iteration and Policy Iteration find an optimal policy.
-
The assumptions and computations involved in each method.
-
Why these methods don’t always scale to real-world tasks—and how that motivates model-free RL and Deep RL .
2. Value Iteration
2a. Objective
Value Iteration focuses on finding the optimal value function $V^*(s)$, which tells us the maximum achievable reward from each state $s$. Once $V^*(s)$ is known, the optimal policy $\pi^*$ follows directly.
2b. The Bellman Optimality Update
Value Iteration uses the Bellman Optimality Equation:
$$V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a)\Big(R(s,a,s') + \gamma V_k(s')\Big).$$
- $V_{k+1}(s)$ : Updated value after the $(k+1)$-th iteration.
- $\max_a$ : We choose the action that yields the highest expected return.
- $\gamma$ : Discount factor balancing immediate vs. future rewards.
We iterate until $V(s)$ converges, then derive the policy:
$$\pi^*(s) = \arg\max_a \sum_{s'} P(s' \mid s,a)\Big(R(s,a,s') + \gamma V(s')\Big).$$
3. Policy Iteration
3a. Objective
Policy Iteration alternates between:
- Policy Evaluation: Evaluate the value function $V_\pi(s)$ under the current policy $\pi$.
- Policy Improvement: Update $\pi$ by choosing the best action given $V_\pi(s)$.
Formally, during policy evaluation:
$$V_\pi(s) = \sum_{s'} P(s' \mid s,\pi(s))\Big(R(s,\pi(s),s') + \gamma V_\pi(s')\Big).$$
Then, policy improvement updates each state’s action to:
$$\pi_{\text{new}}(s) = \arg\max_a \sum_{s'} P(s'\mid s,a)\Big(R(s,a,s') + \gamma V_\pi(s')\Big).$$
When the policy no longer changes, we have the optimal policy $\pi^*$.
4. Comparing Value Iteration and Policy Iteration
Aspect | Value Iteration | Policy Iteration |
---|---|---|
Approach | Iterates on value function directly | Alternates between policy evaluation & improvement |
Convergence | Potentially fewer updates to converge | Often stable & interpretable policy improvements |
Computation | Does not explicitly evaluate a policy at each step | Must solve Vπ exactly during each evaluation step |
5. Limitations of Exact Solution Methods
-
Dependence on a Dynamics Model: Requires the transition function $P(s'\mid s,a)$. In many real-world tasks (like robotics), the exact dynamics are unknown or too complex.
-
Scalability Challenges: Both methods scale poorly with large or continuous state/action spaces.
Despite these limitations, understanding Value Iteration and Policy Iteration provides a critical foundation . You’ll see these ideas resurface in model-free methods (like Q-Learning) and eventually in deep approaches (like DQN).
6. Moving Beyond Exact Methods: Q-Learning
In practice, we often don’t know the environment’s dynamics or have an intractable number of states. Q-Learning is a model-free, off-policy algorithm that learns action values $Q(s,a)$ directly from experience without needing $P(s' \mid s,a)$.Deep Q-Learning (DQN) extends Q-Learning with neural networks, making it feasible to handle high-dimensional inputs (e.g., images from Atari games or sensor data in robotics).
6a. Why This Matters
- Scales to Large State Spaces: By using function approximation (deep networks), we can handle tasks like Lunar Lander or even more complex environments.
- Less Manual Engineering: Instead of explicitly coding or learning the transition model, the agent learns it implicitly through repeated interaction.
7. Summary
Value Iteration and Policy Iteration solve small-scale MDPs exactly by exploiting complete knowledge of the environment. However, real-world tasks often demand model-free strategies due to partial knowledge and vast state spaces.The upcoming lessons will guide you through Q-Learning and Deep Q-Learning —a practical bridge between the idealized solutions in this lesson and the real-world problems we’ll tackle with PyTorch . Eventually, you’ll see how to apply these methods to your own projects and master the Lunar Lander environment.
Hands-On Tip: While these exact methods might not directly solve Lunar Lander (which is continuous or large-discrete in some formulations), coding small grid-world examples is a great way to practice. Once you understand the fundamentals, you can adapt these ideas into PyTorch -powered RL algorithms.
Next Steps: In the upcoming lessons, we’ll focus on model-free RL and how to incorporate deep neural networks for function approximation. You’ll learn step-by-step how to implement, tune, and test these algorithms in an environment like Lunar Lander—paving the way for more advanced DRL projects.