Exact Solution Methods

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

For small or perfectly known environments, we can solve MDPs directly using dynamic programming . In practice, these “Exact” methods are often infeasible for larger environments, but they remain foundational for understanding the mechanics behind more advanced RL.

In this lesson, you’ll learn:

  • How Value Iteration and Policy Iteration find an optimal policy.

  • The assumptions and computations involved in each method.

  • Why these methods don’t always scale to real-world tasks—and how that motivates model-free RL and Deep RL .

2. Value Iteration

2a. Objective

Value Iteration focuses on finding the optimal value function $V^*(s)$, which tells us the maximum achievable reward from each state $s$. Once $V^*(s)$ is known, the optimal policy $\pi^*$ follows directly.

2b. The Bellman Optimality Update

Value Iteration uses the Bellman Optimality Equation:

$$V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a)\Big(R(s,a,s') + \gamma V_k(s')\Big).$$

  • $V_{k+1}(s)$ : Updated value after the $(k+1)$-th iteration.
  • $\max_a$ : We choose the action that yields the highest expected return.
  • $\gamma$ : Discount factor balancing immediate vs. future rewards.

We iterate until $V(s)$ converges, then derive the policy:

$$\pi^*(s) = \arg\max_a \sum_{s'} P(s' \mid s,a)\Big(R(s,a,s') + \gamma V(s')\Big).$$

3. Policy Iteration

3a. Objective

Policy Iteration alternates between:

  1. Policy Evaluation: Evaluate the value function $V_\pi(s)$ under the current policy $\pi$.
  2. Policy Improvement: Update $\pi$ by choosing the best action given $V_\pi(s)$.

Formally, during policy evaluation:

$$V_\pi(s) = \sum_{s'} P(s' \mid s,\pi(s))\Big(R(s,\pi(s),s') + \gamma V_\pi(s')\Big).$$

Then, policy improvement updates each state’s action to:

$$\pi_{\text{new}}(s) = \arg\max_a \sum_{s'} P(s'\mid s,a)\Big(R(s,a,s') + \gamma V_\pi(s')\Big).$$

When the policy no longer changes, we have the optimal policy $\pi^*$.

4. Comparing Value Iteration and Policy Iteration

Aspect Value Iteration Policy Iteration
Approach Iterates on value function directly Alternates between policy evaluation & improvement
Convergence Potentially fewer updates to converge Often stable & interpretable policy improvements
Computation Does not explicitly evaluate a policy at each step Must solve Vπ exactly during each evaluation step

5. Limitations of Exact Solution Methods

  1. Dependence on a Dynamics Model: Requires the transition function $P(s'\mid s,a)$. In many real-world tasks (like robotics), the exact dynamics are unknown or too complex.

  2. Scalability Challenges: Both methods scale poorly with large or continuous state/action spaces.

Despite these limitations, understanding Value Iteration and Policy Iteration provides a critical foundation . You’ll see these ideas resurface in model-free methods (like Q-Learning) and eventually in deep approaches (like DQN).

6. Moving Beyond Exact Methods: Q-Learning

In practice, we often don’t know the environment’s dynamics or have an intractable number of states. Q-Learning is a model-free, off-policy algorithm that learns action values $Q(s,a)$ directly from experience without needing $P(s' \mid s,a)$.Deep Q-Learning (DQN) extends Q-Learning with neural networks, making it feasible to handle high-dimensional inputs (e.g., images from Atari games or sensor data in robotics).

6a. Why This Matters

  • Scales to Large State Spaces: By using function approximation (deep networks), we can handle tasks like Lunar Lander or even more complex environments.
  • Less Manual Engineering: Instead of explicitly coding or learning the transition model, the agent learns it implicitly through repeated interaction.

7. Summary

Value Iteration and Policy Iteration solve small-scale MDPs exactly by exploiting complete knowledge of the environment. However, real-world tasks often demand model-free strategies due to partial knowledge and vast state spaces.The upcoming lessons will guide you through Q-Learning and Deep Q-Learning —a practical bridge between the idealized solutions in this lesson and the real-world problems we’ll tackle with PyTorch . Eventually, you’ll see how to apply these methods to your own projects and master the Lunar Lander environment.

Hands-On Tip: While these exact methods might not directly solve Lunar Lander (which is continuous or large-discrete in some formulations), coding small grid-world examples is a great way to practice. Once you understand the fundamentals, you can adapt these ideas into PyTorch -powered RL algorithms.

Next Steps: In the upcoming lessons, we’ll focus on model-free RL and how to incorporate deep neural networks for function approximation. You’ll learn step-by-step how to implement, tune, and test these algorithms in an environment like Lunar Lander—paving the way for more advanced DRL projects.

Previous Lesson