Why Aren’t VPG, A3C, and GAE Enough?

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

On-policy methods such as VPG (Vanilla Policy Gradient), A3C (Asynchronous Advantage Actor-Critic), and GAE (Generalized Advantage Estimation) laid down essential foundations: they learn from trajectories generated by the current policy, maintaining consistency between exploration and training. However, these methods still suffer from variance, stability, and sample efficiency issues—highlighting the need for further improvements.

2. Why Aren’t VPG, A3C, and GAE Enough?

2a. High Variance in Policy Gradients

2b. Unstable Updates

2c. Poor Sample Efficiency

3. What’s Next and Why?

3a. Improvements in Variance Reduction

3b. Addressing Sample Efficiency

Although still fundamentally on-policy, advanced methods like TRPO and PPO improve sample usage by stabilizing learning. Because they require fewer rollouts to achieve consistent performance gains, they mitigate some of the inefficiencies associated with purely on-policy updates.

4. Why Is This the Logical Progression?

  1. From Variance Reduction to Stability:

    • GAE lowers variance but doesn’t solve the instability from large policy updates. Surrogate objectives (like in PPO) incorporate constraints or clipping to secure stable changes.
  2. From Simple Gradients to Optimized Objectives:

    • VPG and A3C directly optimize the raw policy gradient. Methods like TRPO add trust regions to keep updates in a safe neighborhood, preventing sudden regressions.
  3. From Computational Burden to Efficiency:

    • TRPO is theoretically sound but can be resource-intensive. PPO refines that approach with a simpler, first-order clipping mechanism, providing a good blend of theoretical grounding and practical speed.

5. Summary

Classic on-policy methods—VPG, A3C, and GAE—established how to learn effectively from the current policy’s experience. However, they still contend with high variance, unstable updates, and suboptimal sample efficiency. The next step is to incorporate:

By adopting surrogate losses and bounded policy changes, researchers and practitioners can bridge the gap between theoretical and practical on-policy RL, leading to more reliable, efficient training in both academic research and real-world applications.

Next: We will delve into how TRPO and PPO bring these principles to life—demonstrating how constraints, clipping, or penalties in the surrogate objective lead to robust policy optimization.

Next Lesson