Baseline Choices in Policy Gradient Methods

Last updated: November 25, 2024

1. Introduction

Selecting an appropriate baseline is crucial for reducing variance in gradient estimates without introducing bias. Below are different choices for baselines and their characteristics:

2. Types of Baselines for Policy Gradient

2a. Constant Baseline

A constant baseline is the simplest choice, typically using the average total reward across all trajectories:

$$
b = \mathbb{E}[R(\tau)] \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)}),
$$

where:

is the number of trajectories (rollouts),
is the total reward for the -th trajectory.

This baseline identifies whether a trajectory's performance is above or below average. While it reduces variance, it does not account for differences in individual states or actions.

2b. Optimal Constant Baseline (Minimum Variance Baseline)

The optimal constant baseline minimizes variance in the gradient estimate. It is defined as:

$$
b = \frac{\sum_{i} \left(\nabla_\theta \log P_\theta(\tau^{(i)})\right)^2 R(\tau^{(i)})}{\sum_{i} \left(\nabla_\theta \log P_\theta(\tau^{(i)})\right)^2}.
$$

This baseline weighs rewards based on the square of the gradient magnitude. While it theoretically achieves minimum variance, it is rarely used in practice due to its computational complexity and reliance on trajectory-level data.

2c. Time-Dependent Baseline

A time-dependent baseline adjusts for the finite horizon of rollouts, accounting for how the number of remaining rewards decreases over time. The baseline at time $t$ is defined as:

$$
b_t = \frac{1}{m} \sum_{i=1}^m \sum_{k=t}^{H-1} R(s_k^{(i)}, u_k^{(i)}),
$$

where:

is the horizon length,
is the reward at step in trajectory $i$ .

This approach is useful for problems with finite horizons, as it adjusts expectations dynamically based on time.

2d. State-Dependent Baseline (Value Function)

The most effective and widely used baseline is the state-dependent baseline, where is the expected return from the current state, i.e., the value function:

$$
b(s_t) = \mathbb{E} \left[ r_t + r_{t+1} + \dots + r_{H-1} \right] = V^\pi(s_t).
$$

Using allows for precise comparisons between the actual return and the expected return under the current policy.

Advantages of State-Dependent Baselines

Variance Reduction: By comparing the actual return to the expected return , updates focus on deviations from expected outcomes, reducing noise in the gradient estimate.
Policy Evaluation: Leveraging integrates policy evaluation into policy improvement, aligning updates with long-term goals.

3. Summary of Baseline Choices

Baseline Type	Use Case
Constant Baseline	Simplest, but not state-sensitive.
Optimal Constant	Theoretical minimum variance.
Time-Dependent Baseline	For finite horizon problems.
State-Dependent (Value)	Most effective for variance reduction.

This structured approach to baseline selection ensures efficient and stable learning, making reinforcement learning algorithms more practical and robust.