Exact Solution Methods

Reinforcement Learning

Last updated: December 15, 2024

1. Introduction

In reinforcement learning, the goal is to develop an optimal policy that maximizes the expected sum of discounted rewards over time. For small, fully-known environments, Exact Solution Methods can solve the Markov Decision Process (MDP) directly by leveraging dynamic programming principles.

Two fundamental methods—Value Iteration and Policy Iteration—help calculate the optimal policy by systematically analyzing the environment's states, actions, and rewards. These methods are foundational, forming the basis for understanding more advanced reinforcement learning techniques.

2. Value Iteration

Objective: Value Iteration focuses on determining the optimal value function  $V^*(s)$, which represents the maximum expected reward obtainable from each state $s$. From $V^*(s)$, the optimal policy  $\pi^*$ can be derived.

2a. How It Works

Value Iteration uses the Bellman Optimality Equation  to iteratively update the value of each state until convergence:
$$
 V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V_k(s') \right] 
$$

When $V(s)$ converges (stops changing significantly), the optimal policy is derived:
$$
 \pi^*(s) = \arg \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V(s') \right] 
$$

2b. Key Characteristics

3. Policy Iteration

Objective: Policy Iteration alternates between evaluating a policy  and improving it  to find the optimal policy $\pi^*$.

3a. How It Works

Policy Iteration has two main steps:
 
Step 1: Policy Evaluation

Compute the value function $V_\pi(s)$ for a given policy $\pi$ until convergence:
$$
 V_\pi(s) = \sum_{s'} P(s' \mid s, \pi(s)) \left[ R(s, \pi(s), s') + \gamma V_\pi(s') \right] 
$$

Step 2: Policy Improvement

Update the policy by selecting actions that maximize the expected return based on $V_\pi(s)$:
$$
 \pi_{\text{new}}(s) = \arg \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V_\pi(s') \right] 
$$

If the policy does not change during an improvement step ($\pi_{\text{new}} = \pi$), the algorithm terminates. Otherwise, the process repeats.

3b. Key Characteristics

4. Comparing Value Iteration and Policy Iteration

Aspect Value Iteration Policy Iteration
Approach Iterates on value function. Alternates between policy evaluation and improvement.
Convergence May converge faster due to direct updates. May take more iterations but can be more stable.
Computation No need for explicit policy evaluation. Requires solving Vπ(s) exactly during evaluation.

5. Limitations of Exact Solution Methods

While Value Iteration and Policy Iteration guarantee finding the optimal policy for finite MDPs, they face significant limitations in real-world scenarios:

5a. Dependence on a Dynamics Model

5b. Scalability Challenges

To overcome these challenges, reinforcement learning introduces approximation techniques (e.g., neural networks) to handle large-scale problems.

6. Moving Beyond Exact Methods: Q-Learning

Given these limitations, Q-Learning, a model-free, off-policy algorithm, offers a practical alternative by directly estimating the value of taking actions in states without requiring a full transition model. As environments grow in complexity, traditional Q-Learning is enhanced by incorporating deep learning, leading to Deep Q-Learning (DQN).

Why Deep Q-Learning?

In the next sections, we will explore Q-Learning and Deep Q-Learning, examining their mechanisms, benefits, and applications in real-world scenarios.

7. Summary

Value Iteration and Policy Iteration provide a solid foundation for solving MDPs in small-scale environments. However, their reliance on complete environment knowledge and limited scalability highlight the need for alternative approaches. Understanding these methods is crucial for appreciating the evolution of reinforcement learning techniques toward more practical and scalable solutions, such as Q-Learning and Deep Q-Learning.

Previous Lesson