1. Introduction
On-policy methods such as VPG (Vanilla Policy Gradient), A3C (Asynchronous Advantage Actor-Critic), and GAE (Generalized Advantage Estimation) laid down essential foundations: they learn from trajectories generated by the current policy, maintaining consistency between exploration and training. However, these methods still suffer from variance, stability, and sample efficiency issues—highlighting the need for further improvements.
2. Why Aren’t VPG, A3C, and GAE Enough?
2a. High Variance in Policy Gradients
- Problem: VPG methods can exhibit high variance in their gradient estimates, slowing or destabilizing learning.
- Cause: Monte Carlo estimates of returns (or advantages) are inherently noisy, especially for stochastic policies.
- Current Mitigation: Techniques like GAE mitigate variance by introducing a bias-variance tradeoff. Still, additional methods are needed for robust optimization.
2b. Unstable Updates
- Problem: Large, unconstrained updates can ruin the agent’s policy, causing it to forget previously learned good behaviors.
- Cause: Without any mechanism to limit how much the policy changes per update, the optimization process can diverge.
2c. Poor Sample Efficiency
- Problem: On-policy algorithms use data only once; after updating the policy, they discard old samples.
- Why This Matters: In real-world settings (e.g., robotics, real-time simulations), gathering new data is costly. Inefficient use of samples leads to higher computational expense and longer training times.
3. What’s Next and Why?
3a. Improvements in Variance Reduction
-
Surrogate Loss Functions
- Why? A more stable, indirect objective can better handle local policy changes.
- What? In algorithms like PPO, a clipped surrogate objective limits large updates while still encouraging exploration/exploitation balance.
-
Step Sizing and Trust Regions
- Why? To constrain policy changes so that each update remains stable.
- What? TRPO uses a KL-divergence constraint (“trust region”) to ensure updates don’t deviate too drastically from the old policy.
-
Proximal Policy Optimization (PPO)
- Why? TRPO’s constrained optimization is effective but computationally heavy. PPO implements a clipping mechanism for simpler, more practical updates.
- What? PPO is now among the most widely used on-policy methods, balancing performance and ease of implementation.
3b. Addressing Sample Efficiency
Although still fundamentally on-policy, advanced methods like TRPO and PPO improve sample usage by stabilizing learning. Because they require fewer rollouts to achieve consistent performance gains, they mitigate some of the inefficiencies associated with purely on-policy updates.
4. Why Is This the Logical Progression?
-
From Variance Reduction to Stability:
- GAE lowers variance but doesn’t solve the instability from large policy updates. Surrogate objectives (like in PPO) incorporate constraints or clipping to secure stable changes.
-
From Simple Gradients to Optimized Objectives:
- VPG and A3C directly optimize the raw policy gradient. Methods like TRPO add trust regions to keep updates in a safe neighborhood, preventing sudden regressions.
-
From Computational Burden to Efficiency:
- TRPO is theoretically sound but can be resource-intensive. PPO refines that approach with a simpler, first-order clipping mechanism, providing a good blend of theoretical grounding and practical speed.
5. Summary
Classic on-policy methods—VPG, A3C, and GAE—established how to learn effectively from the current policy’s experience. However, they still contend with high variance, unstable updates, and suboptimal sample efficiency. The next step is to incorporate:
- Surrogate losses (e.g., clipped objectives) for variance reduction and stability.
- Trust regions and step sizing to keep policy updates from spiraling out of control.
- Algorithms like TRPO and PPO that refine these ideas into stable, feasible solutions.
By adopting surrogate losses and bounded policy changes, researchers and practitioners can bridge the gap between theoretical and practical on-policy RL, leading to more reliable, efficient training in both academic research and real-world applications.
Next: We will delve into how TRPO and PPO bring these principles to life—demonstrating how constraints, clipping, or penalties in the surrogate objective lead to robust policy optimization.