Baseline Choices in Policy Gradient Methods

Reinforcement Learning

Last updated: November 25, 2024

1. Introduction

Selecting an appropriate baseline is crucial for reducing variance in gradient estimates without introducing bias. Below are different choices for baselines and their characteristics:

2. Types of Baselines for Policy Gradient

2a. Constant Baseline

A constant baseline is the simplest choice, typically using the average total reward across all trajectories:

$$
 b = \mathbb{E}[R(\tau)] \approx \frac{1}{m} \sum_{i=1}^m R(\tau^{(i)}), 
$$

where:

This baseline identifies whether a trajectory's performance is above or below average. While it reduces variance, it does not account for differences in individual states or actions.

2b. Optimal Constant Baseline (Minimum Variance Baseline)

The optimal constant baseline minimizes variance in the gradient estimate. It is defined as:

$$
 b = \frac{\sum_{i} \left(\nabla_\theta \log P_\theta(\tau^{(i)})\right)^2 R(\tau^{(i)})}{\sum_{i} \left(\nabla_\theta \log P_\theta(\tau^{(i)})\right)^2}. 
$$

This baseline weighs rewards based on the square of the gradient magnitude. While it theoretically achieves minimum variance, it is rarely used in practice due to its computational complexity and reliance on trajectory-level data.

2c. Time-Dependent Baseline

A time-dependent baseline adjusts for the finite horizon of rollouts, accounting for how the number of remaining rewards decreases over time. The baseline at time tt is defined as:

$$
 b_t = \frac{1}{m} \sum_{i=1}^m \sum_{k=t}^{H-1} R(s_k^{(i)}, u_k^{(i)}), 
$$

where:

This approach is useful for problems with finite horizons, as it adjusts expectations dynamically based on time.

2d. State-Dependent Baseline (Value Function)

The most effective and widely used baseline is the state-dependent baseline, where b(st) is the expected return from the current state, i.e., the value function:

$$
 b(s_t) = \mathbb{E} \left[ r_t + r_{t+1} + \dots + r_{H-1} \right] = V^\pi(s_t). 
$$

Using Vπ(st) allows for precise comparisons between the actual return and the expected return under the current policy.

Advantages of State-Dependent Baselines

  1. Variance Reduction: By comparing the actual return Rt to the expected return V(st), updates focus on deviations from expected outcomes, reducing noise in the gradient estimate.
  2. Policy Evaluation: Leveraging V(st) integrates policy evaluation into policy improvement, aligning updates with long-term goals.

3. Summary of Baseline Choices

Baseline Type Use Case
Constant Baseline Simplest, but not state-sensitive.
Optimal Constant Theoretical minimum variance.
Time-Dependent Baseline For finite horizon problems.
State-Dependent (Value) Most effective for variance reduction.

This structured approach to baseline selection ensures efficient and stable learning, making reinforcement learning algorithms more practical and robust.

Previous Lesson Next Lesson