Understanding Baseline Subtraction in Policy Gradient Methods
Last updated: November 25, 2024
1. Introduction
In reinforcement learning, policy gradient methods enable agents to directly optimize their behavior by improving policies based on rewards. However, these methods often suffer from high variance in gradient estimates, which can lead to instability and slow learning. To address this, a baseline is introduced—a reference value subtracted from rewards to stabilize training without altering the gradient's expected value.
Two common baselines are:
- Trajectory-independent: The average reward across trajectories.
- State-dependent: The value function V(st), representing the expected cumulative reward from state st.
2. Intuition Behind Policy Gradient and Baseline Subtraction
-
Basic Policy Gradient:
- Policy gradients aim to improve the probability of favorable trajectories by maximizing the expected reward. The gradient is computed as:
$$
\nabla_\theta \log P(\tau^{(i)}; \theta) \cdot R(\tau^{(i)})
$$ - Actions leading to higher rewards increase in probability, while those with lower rewards decrease.
- Policy gradients aim to improve the probability of favorable trajectories by maximizing the expected reward. The gradient is computed as:
-
Issue with Basic Rewards:
- In environments where rewards are always positive (e.g., R∈[0,1]), the algorithm increases probabilities for all actions, even mediocre ones.
- This can lead to suboptimal trajectories dominating the learning process and slow convergence.
-
Need for Subtlety:
- Instead of focusing on absolute rewards, the algorithm should favor actions that perform better than average and penalize those that perform worse.
-
Solution: Baseline Subtraction:
- By subtracting a baseline (e.g., the average reward or value function) from the total reward, we can focus on the relative advantage of actions.
- This improves learning by reducing the variance of gradient estimates and ensuring efficient updates to the policy.
This allows the agent to prioritize improvements over average outcomes, reducing noise in updates and stabilizing learning.
3. Baseline Subtraction in Policy Gradient
3a. Gradient with Baseline
The standard policy gradient method computes the gradient as:
$$
\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim P_\theta} \left[ \nabla_\theta \log P(\tau; \theta) \cdot R(\tau) \right]
$$
To improve this, we introduce a baseline , modifying the expression to:
$$
\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim P_\theta} \left[ \nabla_\theta \log P(\tau; \theta) \cdot \left( R(\tau) - b \right) \right]
$$
3b. Intuition Behind Baseline Subtraction
- The baseline bb is independent of actions, preserving the gradient's unbiasedness.
- Probabilities increase for trajectories where R(τ)>b, and decrease otherwise.
- This reduces variance and focuses updates on meaningful differences in rewards.
3c. Why Does This Work?
The introduction of a baseline in policy gradient methods is a simple yet powerful enhancement. By reducing variance while maintaining unbiasedness, it improves the efficiency of reinforcement learning algorithms, making them more practical for real-world applications.
- Unbiased Nature: The expectation of ∇θlogP(τ;θ)⋅b is zero because b does not influence actions.
- Variance Reduction: Subtracting b lowers gradient variance, leading to faster and more stable learning.
By maintaining unbiasedness and reducing variance, baseline subtraction is a practical improvement for efficient policy gradient learning.
4. Temporal Decomposition of Rewards
The trajectory-level gradient can be broken into time-step contributions using the Markov property:
$$
\log P_\theta(\tau) = \sum_{t=0}^{H-1} \log \pi_\theta(u_t | s_t),
$$
Similarly, the total reward decomposes as:
$$
R(\tau) = \sum_{t=0}^{H-1} r(s_t, u_t)
$$
This allows the gradient to focus on individual time-step contributions, simplifying computations.
5. Eliminating Irrelevant Past Rewards
Actions at time tt cannot influence rewards from earlier time steps because past rewards are fixed. The gradient contribution from past rewards is zero:
$$
\nabla_\theta \log \pi_\theta(u_t | s_t) \cdot \text{(past rewards)} = 0
$$
The gradient simplifies to:
$$
\hat{g} = \frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(u_t^{(i)} | s_t^{(i)}) \cdot \left( \sum_{k=t}^{H-1} r(s_k^{(i)}, u_k^{(i)}) - b \right)
$$
This focuses the gradient on future rewards, reducing variance and improving efficiency.
5a. Practical Policy Gradient Equation
With irrelevant terms removed and a state-dependent baseline V(st), the policy gradient is:
$$
\hat{g} = \frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(u_t^{(i)} | s_t^{(i)}) \cdot \left( R_t^{(i)} - V(s_t^{(i)}) \right),
$$
where:
$$R_t^{(i)} = \sum_{k=t}^{H-1} r(s_k^{(i)}, u_k^{(i)})$$ is the future reward from time t,
V(st(i)) is the state-value function baseline.
6. Summary
- Baseline Subtraction Reduces Variance: Subtracting a baseline, particularly a state-dependent baseline V(st), reduces variance in the gradient estimate without biasing it.
- Temporal Dependency: Only future rewards contribute to the gradient at time t.
- Action Updates:
- If Rt>V(st), increase the probability of ut (action performed well).
- If Rt<V(st), decrease the probability of ut (action underperformed).
This approach ensures efficient learning by focusing updates on performance relative to expectations.