1. Introduction
Selecting an appropriate baseline is crucial for reducing variance in gradient estimates without introducing bias. Below are different choices for baselines and their characteristics:
2. Types of Baselines for Policy Gradient
2a. Constant Baseline
A constant baseline is the simplest choice, typically using the average total reward across all trajectories:
$$
b = \mathbb{E}[R(\tau)] \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)}),
$$
where:
- m is the number of trajectories (rollouts),
- R(τ(i)) is the total reward for the i-th trajectory.
This baseline identifies whether a trajectory's performance is above or below average. While it reduces variance, it does not account for differences in individual states or actions.
2b. Optimal Constant Baseline (Minimum Variance Baseline)
The optimal constant baseline minimizes variance in the gradient estimate. It is defined as:
$$
b = \frac{\sum_{i} \left(\nabla_\theta \log P_\theta(\tau^{(i)})\right)^2 R(\tau^{(i)})}{\sum_{i} \left(\nabla_\theta \log P_\theta(\tau^{(i)})\right)^2}.
$$
This baseline weighs rewards based on the square of the gradient magnitude. While it theoretically achieves minimum variance, it is rarely used in practice due to its computational complexity and reliance on trajectory-level data.
2c. Time-Dependent Baseline
A time-dependent baseline adjusts for the finite horizon of rollouts, accounting for how the number of remaining rewards decreases over time. The baseline at time tt is defined as:
$$
b_t = \frac{1}{m} \sum_{i=1}^m \sum_{k=t}^{H-1} R(s_k^{(i)}, u_k^{(i)}),
$$
where:
- H is the horizon length,
- R(sk(i),uk(i)) is the reward at step k in trajectory .
This approach is useful for problems with finite horizons, as it adjusts expectations dynamically based on time.
2d. State-Dependent Baseline (Value Function)
The most effective and widely used baseline is the state-dependent baseline, where b(st) is the expected return from the current state, i.e., the value function:
$$
b(s_t) = \mathbb{E} \left[ r_t + r_{t+1} + \dots + r_{H-1} \right] = V^\pi(s_t).
$$
Using Vπ(st) allows for precise comparisons between the actual return and the expected return under the current policy.
Advantages of State-Dependent Baselines
- Variance Reduction: By comparing the actual return Rt to the expected return V(st), updates focus on deviations from expected outcomes, reducing noise in the gradient estimate.
- Policy Evaluation: Leveraging V(st) integrates policy evaluation into policy improvement, aligning updates with long-term goals.
3. Summary of Baseline Choices
Baseline Type | Use Case |
---|---|
Constant Baseline | Simplest, but not state-sensitive. |
Optimal Constant | Theoretical minimum variance. |
Time-Dependent Baseline | For finite horizon problems. |
State-Dependent (Value) | Most effective for variance reduction. |
This structured approach to baseline selection ensures efficient and stable learning, making reinforcement learning algorithms more practical and robust.