Why Aren’t VPG, A3C, and GAE Enough?

Last updated: January 01, 2025

1. Introduction

On-policy methods such as VPG (Vanilla Policy Gradient), A3C (Asynchronous Advantage Actor-Critic), and GAE (Generalized Advantage Estimation) laid down essential foundations: they learn from trajectories generated by the current policy, maintaining consistency between exploration and training. However, these methods still suffer from variance, stability, and sample efficiency issues—highlighting the need for further improvements.

2. Why Aren’t VPG, A3C, and GAE Enough?

2a. High Variance in Policy Gradients

Problem: VPG methods can exhibit high variance in their gradient estimates, slowing or destabilizing learning.
Cause: Monte Carlo estimates of returns (or advantages) are inherently noisy, especially for stochastic policies.
Current Mitigation: Techniques like GAE mitigate variance by introducing a bias-variance tradeoff. Still, additional methods are needed for robust optimization.

2b. Unstable Updates

Problem: Large, unconstrained updates can ruin the agent’s policy, causing it to forget previously learned good behaviors.
Cause: Without any mechanism to limit how much the policy changes per update, the optimization process can diverge.

2c. Poor Sample Efficiency

Problem: On-policy algorithms use data only once; after updating the policy, they discard old samples.
Why This Matters: In real-world settings (e.g., robotics, real-time simulations), gathering new data is costly. Inefficient use of samples leads to higher computational expense and longer training times.

3. What’s Next and Why?

3a. Improvements in Variance Reduction

Surrogate Loss Functions
- Why? A more stable, indirect objective can better handle local policy changes.
- What? In algorithms like PPO, a clipped surrogate objective limits large updates while still encouraging exploration/exploitation balance.
Step Sizing and Trust Regions
- Why? To constrain policy changes so that each update remains stable.
- What? TRPO uses a KL-divergence constraint (“trust region”) to ensure updates don’t deviate too drastically from the old policy.
Proximal Policy Optimization (PPO)
- Why? TRPO’s constrained optimization is effective but computationally heavy. PPO implements a clipping mechanism for simpler, more practical updates.
- What? PPO is now among the most widely used on-policy methods, balancing performance and ease of implementation.

3b. Addressing Sample Efficiency

Although still fundamentally on-policy, advanced methods like TRPO and PPO improve sample usage by stabilizing learning. Because they require fewer rollouts to achieve consistent performance gains, they mitigate some of the inefficiencies associated with purely on-policy updates.

4. Why Is This the Logical Progression?

From Variance Reduction to Stability:
- GAE lowers variance but doesn’t solve the instability from large policy updates. Surrogate objectives (like in PPO) incorporate constraints or clipping to secure stable changes.
From Simple Gradients to Optimized Objectives:
- VPG and A3C directly optimize the raw policy gradient. Methods like TRPO add trust regions to keep updates in a safe neighborhood, preventing sudden regressions.
From Computational Burden to Efficiency:
- TRPO is theoretically sound but can be resource-intensive. PPO refines that approach with a simpler, first-order clipping mechanism, providing a good blend of theoretical grounding and practical speed.

5. Summary

Classic on-policy methods—VPG, A3C, and GAE—established how to learn effectively from the current policy’s experience. However, they still contend with high variance, unstable updates, and suboptimal sample efficiency. The next step is to incorporate:

Surrogate losses (e.g., clipped objectives) for variance reduction and stability.
Trust regions and step sizing to keep policy updates from spiraling out of control.
Algorithms like TRPO and PPO that refine these ideas into stable, feasible solutions.

By adopting surrogate losses and bounded policy changes, researchers and practitioners can bridge the gap between theoretical and practical on-policy RL, leading to more reliable, efficient training in both academic research and real-world applications.

Next: We will delve into how TRPO and PPO bring these principles to life—demonstrating how constraints, clipping, or penalties in the surrogate objective lead to robust policy optimization.

Why Aren’t VPG, A3C, and GAE Enough?

1. Introduction

2. Why Aren’t VPG, A3C, and GAE Enough?

2a. High Variance in Policy Gradients

2b. Unstable Updates

2c. Poor Sample Efficiency

3. What’s Next and Why?

3a. Improvements in Variance Reduction

Surrogate Loss Functions

Step Sizing and Trust Regions

Proximal Policy Optimization (PPO)

3b. Addressing Sample Efficiency

4. Why Is This the Logical Progression?

From Variance Reduction to Stability:

From Simple Gradients to Optimized Objectives:

From Computational Burden to Efficiency:

5. Summary