Value Functions: The Foundation of Policy Evaluation and Optimization

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

Value functions lie at the core of reinforcement learning (RL). They quantify how good it is to be in a certain state (or to take a certain action in that state) under a given policy. By estimating expected returns, value functions help:

In this lesson, we cover the mathematical foundation of value functions and practical ways to estimate them, such as Monte Carlo and bootstrapping with the Bellman equation. By the end, you’ll see why value functions are so critical for stable and efficient learning.

2. Value Functions Overview

2a. Definition

  1. State Value Function $V^\pi(s)$:

    $$V^\pi(s) \;=\; \mathbb{E}_{\pi}\!\Bigl[\sum_{t=0}^{\infty} \gamma^t\,R(s_t,u_t)\,\bigm|\,s_0 = s\Bigr]$$
    • Reflects the expected discounted reward when starting in state $s$ and following policy $\pi$.
    • $\gamma \in [0,1]$ is the discount factor.
  2. Action Value Function $Q^\pi(s,u)$:

    $$Q^\pi(s,u) \;=\; \mathbb{E}_{\pi}\!\Bigl[\sum_{t=0}^{\infty} \gamma^t\,R(s_t,u_t)\,\bigm|\,s_0 = s,\, u_0 = u\Bigr].$$
    • Measures how good an action $u$ is from state $s$ before continuing with policy $\pi$.

These functions are the workhorses of many RL algorithms because they provide a quantitative way to compare states and actions.

2b. Why Value Functions Matter

  1. Policy Evaluation: You can track $V^\pi$ or $Q^\pi$ to see if a policy is improving or if it still needs adjustments.
  2. Variance Reduction: In policy gradient methods, having a good estimate of $V^\pi(s)$ (the baseline) can significantly reduce gradient variance, speeding up training.
  3. Decision Making: Value estimates help the agent identify promising actions or states to prioritize.

2c. Relationship between $V^\pi$ and $Q^\pi$

$$V^\pi(s)\;=\; \mathbb{E}_{a \sim \pi(\cdot \mid s)}\!\bigl[\, Q^\pi(s,a) \bigr].$$

That is, the value of state $s$ is the expectation of the action-value function over the actions the policy would take.

3. Monte Carlo Value Estimation

In Monte Carlo (MC) methods, you estimate value functions using empirical returns from full or partial episodes.

3a. How It Works

  1. Rollouts: Collect trajectories (episodes) $\tau$ by following policy $\pi$ in the environment (e.g., Lunar Lander).
  2. Returns: For each state $s_t$, compute the cumulative discounted return from that point onward:
    $$R_t = \sum_{k=t}^{H-1} \gamma^{k-t}\,R(s_k,\,u_k),$$
    where $H$ is the episode length.
  3. Fit a Baseline $V_\phi^\pi(s)$ (e.g., a neural network) to minimize the difference between predicted and actual returns: $$\min_\phi \sum_{t}(V_\phi^\pi(s_t) - R_t)^2.$$

3b. Why Use Monte Carlo Estimates?

3c. Example: Lunar Lander Baseline

4. Bootstrap Value Estimation (Bellman Equation)

Monte Carlo requires full trajectories to estimate returns. Bootstrap methods (like Temporal-Difference or Dynamic Programming) can update value estimates before the episode ends using the Bellman equation.

4a. Bellman Equation for Policy Evaluation

$$V^\pi(s)\;=\; \sum_{a} \pi(a \mid s)\,\sum_{s'} P(s' \mid s,a)\,\bigl[R(s,a,s') + \gamma\,V^\pi(s')\bigr].$$

In practice, we estimate via sample transitions $(s,a,r,s')$: $$V_\phi^\pi(s) \;\approx\; r + \gamma\,V_\phi^\pi(s').$$

4b. Fitted Value Iteration (Bootstrap Approach)

  1. Collect transitions $(s,a,r,s')$.
  2. Update the value network $V_\phi^\pi$ by minimizing: $\Bigl(r + \gamma V_\phi^\pi(s') - V_\phi^\pi(s)\Bigr)^2.$
  3. Iterate until stable.

4c. Why Bootstrap?

5. Summary: Value Functions in a Nutshell

Previous Lesson Next Lesson