Value Function Estimation

Last updated: November 25, 2024

1. Introduction

In this lesson, we delve into the significance of value functions in reinforcement learning, particularly their role in evaluating and improving policies. Value functions estimate the expected cumulative reward, serving as a critical tool for stabilizing and enhancing learning in methods like Monte Carlo value function estimation and policy gradient algorithms.

We begin by exploring the mathematical foundation of value functions, followed by their practical importance in reinforcement learning. You'll learn how to estimate value functions using collected trajectories through Monte Carlo methods and how these estimates reduce variance in policy optimization. Finally, we introduce bootstrap value function estimation, which utilizes the Bellman equation for efficient, iterative updates.

By the end of this lesson, you’ll have a solid understanding of the role of value functions in reinforcement learning, the methods to estimate them, and their impact on policy optimization stability.

2. Value Functions

Value functions are fundamental concepts in reinforcement learning, representing the expected cumulative reward starting from a given state or state-action pair and following a specific policy . These functions are crucial for evaluating and improving policies during the learning process.

2a. What Are Value Functions?

State Value Function (): The expected cumulative reward when starting in state and following policy :
$$
V^\pi(s) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^\infty \gamma^t R(s_t, u_t) \mid s_0 = s \right]
$$
- : Discount factor (), which determines the importance of future rewards.
- : Reward received after taking action $_{t}$ in state .
Action Value Function (): The expected cumulative reward when starting in state s, taking action u, and then following policy $π$ :
$$
Q^\pi(s, u) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^\infty \gamma^t R(s_t, u_t) \mid s_0 = s, u_0 = u \right]
$$

$2b.$ Why Are Value Functions Important?

Policy Evaluation: Value functions quantify how "good" a state or action is under a policy $π\pi$ , allowing us to evaluate the performance of a policy.
Policy Optimization: In policy gradient methods, value functions act as a baseline to reduce the variance of gradient estimates, enabling more stable learning.
Decision Making: Value functions guide the agent's decision-making process by estimating the long-term benefits of states or actions.

2c. Relationship Between and

The state value function can be expressed in terms of the action value function as follows:

$$
V^\pi(s) = \mathbb{E}_{u \sim \pi} \left[ Q^\pi(s, u) \right]
$$

This shows that the value of a state is the expected value of the actions taken from that state, weighted by the policy's action probabilities.

3. Monte Carlo Value Function Estimation

Monte Carlo methods estimate the value of a policy by using the empirical returns from trajectories. The importance of these estimates is that they:

Reduces variance of policy gradient estimates.
Stabilizes learning and improves convergence in policy optimization.

3a. Policy Gradient Estimate

The gradient of the policy is given by:

$$
\frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta\left(u_t^{(i)} \mid s_t^{(i)}\right) \left( \sum_{k=t}^{H-1} R\left(s_k^{(i)}, u_k^{(i)}\right) - V^\pi\left(s_t^{(i)}\right) \right)
$$

m: Number of trajectories collected during rollouts.
H: Horizon (number of steps in an episode).
R(s_t, a_t): Reward received at state s_t after taking action a_t.
(s_t): Estimated value function, serving as a baseline.

Purpose of the Baseline (s_t): reduces variance in the gradient estimate without altering its expectation.

3b. Estimating the Value Function

To estimate , we use a regression method based on collected trajectories.

Initialization: Initialize the value function as a neural network parameterized by .
Trajectory Collection: Gather trajectories by rolling out the current policy .
Regression Against Empirical Returns:
- For each state , compute the Monte Carlo return , which is the cumulative reward from time step $t$ :
  $$
  R_t^{(i)} = \sum_{k=t}^{H-1} R\left(s_k^{(i)}, u_k^{(i)}\right)
  $$
- Fit by minimizing the squared error between the estimated value function and the Monte Carlo returns:
  $$
  \phi_{i+1} \leftarrow \arg \min_\phi \frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{H-1} \left( V^\pi_\phi\left(s_t^{(i)}\right) - G_t^{(i)} \right)^2
  $$

3c. Algorithm Summary

Input: Initial policy , value function , number of trajectories $m$ , horizon $H$ .
Steps:
1. Initialize
2. Collect m trajectories using .
3. Compute Monte Carlo returns for all $t, i$ .
4. Update by minimizing the regression loss.
5. Repeat for subsequent updates of .

4. Bootstrap Value Function Estimation

The Bootstrap Estimation method utilizes the Bellman Equation for policy evaluation to estimate the value function

Bellman Equation: The value of a state s under a policy is calculated as the expected immediate reward plus the discounted value of the next state:
$$
V^\pi(s) = \sum_u \pi(u \mid s) \sum_{s'} P(s' \mid s, u) \left[ R(s, u, s') + \gamma V^\pi(s') \right]
$$
- : Probability of taking action u in state s.
- : Transition probability from state s to given action u.
- : Reward received after transitioning to state
- : Discount factor that weighs future rewards.
Data Collection: Collect transition samples , which include:
- Current state s.
- Action u taken.
- Next state .
- Reward observed.
Fitted Value Iteration: Update the parameters $ϕ\phi$ of the value function iteratively. The parameters are optimized by minimizing the following loss function:
- The first term minimizes the error in the Bellman equation.
- The regularization term () prevents drastic parameter updates.

Purpose and Advantages

This approach bootstraps the value function by iteratively refining it based on sampled transitions. It effectively balances:

Dynamic Programming (by leveraging the Bellman equation).
Data-Driven Updates (using collected samples). This makes the method suitable for large-scale problems, where exact solutions are computationally infeasible.

5. Summary

Value functions lie at the heart of reinforcement learning, enabling agents to evaluate and optimize policies by estimating the expected cumulative rewards.

Monte Carlo Estimation: This approach estimates value functions based on empirical returns from sampled trajectories, offering a variance-reduction mechanism for policy gradient methods and stabilizing the learning process.
Bootstrap Estimation: By leveraging the Bellman equation, bootstrap methods iteratively refine value function estimates using sampled transitions, combining dynamic programming with data-driven updates for scalability and efficiency.

Together, these methods highlight the importance of value functions in improving policy evaluation and optimization, setting the stage for advanced reinforcement learning techniques. The concepts covered here provide a robust foundation for tackling more sophisticated algorithms introduced in subsequent lessons.