Vanilla Policy Gradient (VPG)

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

Vanilla Policy Gradient (VPG) is a foundational algorithm that directly optimizes a policy πθ by maximizing the expected return via gradient ascent. Instead of learning a Q-function to derive a policy, VPG simply updates θ in the direction that increases the probability of rewarding trajectories.

1a. Why VPG?

2. What VPG Offers

  1. Simple Conceptual Flow:
    • Collect trajectories, compute returns, update the policy in proportion to how good those returns are.
  2. Baseline for Variance Reduction:
    • Incorporating a value function baseline (as discussed above) can significantly reduce gradient variance.
  3. Policy-Centric:
    • The algorithm focuses purely on improving πθ, which can be beneficial in complicated environments like Lunar Lander.

3. Algorithm Steps

Initialization:

Loop (until convergence):

  1. Trajectory Collection: Roll out episodes {τi} under the current policy πθ.
  2. Returns & Advantages: For each time step t, compute:
    Rt=k=tT1γktrk, At=RtVϕ(st).
  3. Baseline Update: Fit Vϕ to the returns: minϕt(Vϕ(st)Rt)2.
  4. Policy UpdateθJ(θ)=tθlogπθ(atst)(At).
    Perform gradient ascent on θ.

4. Practical Considerations

  1. High Variance: VPG can be noisy; that’s why a good baseline or advanced advantage methods are key.
  2. Sample Efficiency: Must gather new trajectories from the current policy each iteration (on-policy).
  3. Lunar Lander Example:
    • You gather a few episodes using the current πθ.
    • Many landings might fail at first, but you compute returns anyway.
    • Over time, you see improvements as the policy learns to control thrusters more precisely, using the advantage to adjust the policy parameters.

5. Summary

Previous Lesson Next Lesson