Vanilla Policy Gradient (VPG)

Last updated: January 01, 2025

1. Introduction

Vanilla Policy Gradient (VPG) is a foundational algorithm that directly optimizes a policy $π_{θ}$ by maximizing the expected return via gradient ascent. Instead of learning a Q-function to derive a policy, VPG simply updates $θ$ in the direction that increases the probability of rewarding trajectories.

1a. Why VPG?

Direct approach: No need for a separate value-based step or complex argmax computations.
Flexible: Works in continuous action spaces where Q-learning might struggle.

2. What VPG Offers

Simple Conceptual Flow:
- Collect trajectories, compute returns, update the policy in proportion to how good those returns are.
Baseline for Variance Reduction:
- Incorporating a value function baseline (as discussed above) can significantly reduce gradient variance.
Policy-Centric:
- The algorithm focuses purely on improving $π_{θ}$ , which can be beneficial in complicated environments like Lunar Lander.

3. Algorithm Steps

Initialization:

Policy parameters $θ$ .
Baseline parameters (value function) $ϕ$ .

Loop (until convergence):

Trajectory Collection: Roll out episodes ${τ_{i}}$ under the current policy $π_{θ}$ .
Returns & Advantages: For each time step $t$ , compute:
$R_{t} = \sum_{k = t}^{T - 1} γ^{k - t} r_{k},$ $A_{t} = R_{t} - V_{ϕ} (s_{t}) .$
Baseline Update: Fit $V_{ϕ}$ to the returns: $min_{ϕ} \sum_{t} (V_{ϕ} (s_{t}) - R_{t})^{2} .$
Policy Update $\nabla_{θ} J (θ) = \sum_{t} \nabla_{θ} \log π_{θ} (a_{t} ∣ s_{t}) (A_{t}) .$
Perform gradient ascent on $θ$ .

4. Practical Considerations

High Variance: VPG can be noisy; that’s why a good baseline or advanced advantage methods are key.
Sample Efficiency: Must gather new trajectories from the current policy each iteration (on-policy).
Lunar Lander Example:
- You gather a few episodes using the current $π_{θ}$ .
- Many landings might fail at first, but you compute returns anyway.
- Over time, you see improvements as the policy learns to control thrusters more precisely, using the advantage to adjust the policy parameters.

5. Summary

VPG is a direct, relatively simple approach to on-policy RL.
It forms the basis for many advanced methods (e.g., PPO, A2C).
Key to success: variance-reducing measures like a well-trained value function baseline.