Exact Solution Methods

Last updated: December 15, 2024

1. Introduction

In reinforcement learning, the goal is to develop an optimal policy that maximizes the expected sum of discounted rewards over time. For small, fully-known environments, Exact Solution Methods can solve the Markov Decision Process (MDP) directly by leveraging dynamic programming principles.

Two fundamental methods—Value Iteration and Policy Iteration—help calculate the optimal policy by systematically analyzing the environment's states, actions, and rewards. These methods are foundational, forming the basis for understanding more advanced reinforcement learning techniques.

2. Value Iteration

Objective: Value Iteration focuses on determining the optimal value function $V^*(s)$, which represents the maximum expected reward obtainable from each state $s$. From $V^*(s)$, the optimal policy $\pi^*$ can be derived.

2a. How It Works

Value Iteration uses the Bellman Optimality Equation to iteratively update the value of each state until convergence:
$$
V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V_k(s') \right]
$$

$V_{k+1}(s)$: Updated value of state $s$ after $k+1$ iterations.
$a$: Actions available in state $s$.
$s'$: Next state.
$P(s' \mid s, a)$: Probability of transitioning to $s'$ from $s$ by taking action $a$.
$R(s, a, s')$: Reward for transitioning from $s$ to $s'$ with action $a$.
$\gamma$: Discount factor that weighs future rewards.

When $V(s)$ converges (stops changing significantly), the optimal policy is derived:
$$
\pi^*(s) = \arg \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V(s') \right]
$$

2b. Key Characteristics

Directly updates the value function without explicitly evaluating a policy.
Suitable for problems where determining the policy at each iteration is unnecessary.
Efficient for small MDPs.

3. Policy Iteration

Objective: Policy Iteration alternates between evaluating a policy and improving it to find the optimal policy $\pi^*$.

3a. How It Works

Policy Iteration has two main steps:

Step 1: Policy Evaluation

Compute the value function $V_\pi(s)$ for a given policy $\pi$ until convergence:
$$
V_\pi(s) = \sum_{s'} P(s' \mid s, \pi(s)) \left[ R(s, \pi(s), s') + \gamma V_\pi(s') \right]
$$

$\pi(s)$: Action prescribed by policy $\pi$ in state $s$.
$V_\pi(s)$: Value of state $s$ under policy $\pi$.

Step 2: Policy Improvement

Update the policy by selecting actions that maximize the expected return based on $V_\pi(s)$:
$$
\pi_{\text{new}}(s) = \arg \max_a \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V_\pi(s') \right]
$$

If the policy does not change during an improvement step ($\pi_{\text{new}} = \pi$), the algorithm terminates. Otherwise, the process repeats.

3b. Key Characteristics

Separates evaluation and improvement steps for better interpretability.
Often converges in fewer iterations compared to Value Iteration.
Provides more insight into how the policy evolves.

4. Comparing Value Iteration and Policy Iteration

Aspect	Value Iteration	Policy Iteration
Approach	Iterates on value function.	Alternates between policy evaluation and improvement.
Convergence	May converge faster due to direct updates.	May take more iterations but can be more stable.
Computation	No need for explicit policy evaluation.	Requires solving V_π(s) exactly during evaluation.

5. Limitations of Exact Solution Methods

While Value Iteration and Policy Iteration guarantee finding the optimal policy for finite MDPs, they face significant limitations in real-world scenarios:

5a. Dependence on a Dynamics Model

Both methods require the transition model $P(s' \mid s, a)$, which is often unknown in practical environments.
Real-world problems often require model-free methods that rely on experiences rather than full knowledge of the environment.

5b. Scalability Challenges

Exact methods loop over all states and actions, which is computationally infeasible for large state spaces.
High-dimensional or continuous state-action spaces exacerbate this issue.

To overcome these challenges, reinforcement learning introduces approximation techniques (e.g., neural networks) to handle large-scale problems.

6. Moving Beyond Exact Methods: Q-Learning

Given these limitations, Q-Learning, a model-free, off-policy algorithm, offers a practical alternative by directly estimating the value of taking actions in states without requiring a full transition model. As environments grow in complexity, traditional Q-Learning is enhanced by incorporating deep learning, leading to Deep Q-Learning (DQN).

Why Deep Q-Learning?

Scales to high-dimensional problems, such as image-based tasks.
Combines the flexibility of Q-Learning with the representational power of deep neural networks.

In the next sections, we will explore Q-Learning and Deep Q-Learning, examining their mechanisms, benefits, and applications in real-world scenarios.

7. Summary

Value Iteration and Policy Iteration provide a solid foundation for solving MDPs in small-scale environments. However, their reliance on complete environment knowledge and limited scalability highlight the need for alternative approaches. Understanding these methods is crucial for appreciating the evolution of reinforcement learning techniques toward more practical and scalable solutions, such as Q-Learning and Deep Q-Learning.