1. Introduction
Vanilla Policy Gradient (VPG) is a foundational algorithm that directly optimizes a policy
1a. Why VPG?
- Direct approach: No need for a separate value-based step or complex argmax computations.
- Flexible: Works in continuous action spaces where Q-learning might struggle.
2. What VPG Offers
- Simple Conceptual Flow:
- Collect trajectories, compute returns, update the policy in proportion to how good those returns are.
- Baseline for Variance Reduction:
- Incorporating a value function baseline (as discussed above) can significantly reduce gradient variance.
- Policy-Centric:
- The algorithm focuses purely on improving
, which can be beneficial in complicated environments like Lunar Lander.
- The algorithm focuses purely on improving
3. Algorithm Steps
Initialization:
- Policy parameters
. - Baseline parameters (value function)
.
Loop (until convergence):
- Trajectory Collection: Roll out episodes
under the current policy . - Returns & Advantages: For each time step
, compute: - Baseline Update: Fit
to the returns: - Policy Update
Perform gradient ascent on .
4. Practical Considerations
- High Variance: VPG can be noisy; that’s why a good baseline or advanced advantage methods are key.
- Sample Efficiency: Must gather new trajectories from the current policy each iteration (on-policy).
- Lunar Lander Example:
- You gather a few episodes using the current
. - Many landings might fail at first, but you compute returns anyway.
- Over time, you see improvements as the policy learns to control thrusters more precisely, using the advantage to adjust the policy parameters.
- You gather a few episodes using the current
5. Summary
- VPG is a direct, relatively simple approach to on-policy RL.
- It forms the basis for many advanced methods (e.g., PPO, A2C).
- Key to success: variance-reducing measures like a well-trained value function baseline.