Value Functions: The Foundation of Policy Evaluation and Optimization
Last updated: January 01, 2025
1. Introduction
Value functions lie at the core of reinforcement learning (RL). They quantify how good it is to be in a certain state (or to take a certain action in that state) under a given policy. By estimating expected returns, value functions help:
- Evaluate how well the agent is doing under its current policy.
- Improve policy updates by reducing variance (serving as a “baseline” in policy gradients).
- Guide the agent’s decision-making (in combination with or instead of Q-functions).
In this lesson, we cover the mathematical foundation of value functions and practical ways to estimate them, such as Monte Carlo and bootstrapping with the Bellman equation. By the end, you’ll see why value functions are so critical for stable and efficient learning.
2. Value Functions Overview
2a. Definition
-
State Value Function $V^\pi(s)$:
$$V^\pi(s) \;=\; \mathbb{E}_{\pi}\!\Bigl[\sum_{t=0}^{\infty} \gamma^t\,R(s_t,u_t)\,\bigm|\,s_0 = s\Bigr]$$- Reflects the expected discounted reward when starting in state $s$ and following policy $\pi$.
- $\gamma \in [0,1]$ is the discount factor.
-
Action Value Function $Q^\pi(s,u)$:
$$Q^\pi(s,u) \;=\; \mathbb{E}_{\pi}\!\Bigl[\sum_{t=0}^{\infty} \gamma^t\,R(s_t,u_t)\,\bigm|\,s_0 = s,\, u_0 = u\Bigr].$$- Measures how good an action $u$ is from state $s$ before continuing with policy $\pi$.
These functions are the workhorses of many RL algorithms because they provide a quantitative way to compare states and actions.
2b. Why Value Functions Matter
- Policy Evaluation: You can track $V^\pi$ or $Q^\pi$ to see if a policy is improving or if it still needs adjustments.
- Variance Reduction: In policy gradient methods, having a good estimate of $V^\pi(s)$ (the baseline) can significantly reduce gradient variance, speeding up training.
- Decision Making: Value estimates help the agent identify promising actions or states to prioritize.
2c. Relationship between $V^\pi$ and $Q^\pi$
$$V^\pi(s)\;=\; \mathbb{E}_{a \sim \pi(\cdot \mid s)}\!\bigl[\, Q^\pi(s,a) \bigr].$$
That is, the value of state $s$ is the expectation of the action-value function over the actions the policy would take.
3. Monte Carlo Value Estimation
In Monte Carlo (MC) methods, you estimate value functions using empirical returns from full or partial episodes.
3a. How It Works
- Rollouts: Collect trajectories (episodes) $\tau$ by following policy $\pi$ in the environment (e.g., Lunar Lander).
- Returns: For each state $s_t$, compute the cumulative discounted return from that point onward:
$$R_t = \sum_{k=t}^{H-1} \gamma^{k-t}\,R(s_k,\,u_k),$$
where $H$ is the episode length. - Fit a Baseline $V_\phi^\pi(s)$ (e.g., a neural network) to minimize the difference between predicted and actual returns: $$\min_\phi \sum_{t}(V_\phi^\pi(s_t) - R_t)^2.$$
3b. Why Use Monte Carlo Estimates?
- Direct Empirical Returns: No need for a model of dynamics; you rely on real outcomes.
- Variance Reduction in Policy Gradients: Substituting $R_t - V^\pi(s_t)$ as an advantage helps the policy gradient focus on actions that outperform expectations.
3c. Example: Lunar Lander Baseline
- Roll out a handful of Lunar Lander episodes (say 10–20) using your current policy.
- For each state visited, compute the return from that point until crash/landing.
- Train a simple MLP $V_\phi^\pi(s)$ to predict these returns. Over multiple iterations, your value function approximates how “good” it is to be in a particular velocity/position configuration.
4. Bootstrap Value Estimation (Bellman Equation)
Monte Carlo requires full trajectories to estimate returns. Bootstrap methods (like Temporal-Difference or Dynamic Programming) can update value estimates before the episode ends using the Bellman equation.
4a. Bellman Equation for Policy Evaluation
$$V^\pi(s)\;=\; \sum_{a} \pi(a \mid s)\,\sum_{s'} P(s' \mid s,a)\,\bigl[R(s,a,s') + \gamma\,V^\pi(s')\bigr].$$
In practice, we estimate via sample transitions $(s,a,r,s')$: $$V_\phi^\pi(s) \;\approx\; r + \gamma\,V_\phi^\pi(s').$$
4b. Fitted Value Iteration (Bootstrap Approach)
- Collect transitions $(s,a,r,s')$.
- Update the value network $V_\phi^\pi$ by minimizing: $\Bigl(r + \gamma V_\phi^\pi(s') - V_\phi^\pi(s)\Bigr)^2.$
- Iterate until stable.
4c. Why Bootstrap?
- Efficiency: You don’t need to wait until the end of the episode to update.
- Scalability: Effective for large problems if combined with function approximation (like neural nets).
5. Summary: Value Functions in a Nutshell
- They form the backbone of many RL strategies by providing a measure of future return.
- Monte Carlo offers a straightforward way to gather end-of-episode returns to train the value function.
- Bootstrap methods use the Bellman equation for iterative updates, reducing dependence on full episodes.
- Policy Gradient approaches heavily rely on good value estimates (baselines) for stable, efficient learning.