Stabilizing and Optimizing On-Policy Learning

Last updated: November 25, 2024

1. Introduction

On-policy methods like VPG, A3C, and GAE establish the foundational concepts for learning from trajectories while staying true to the policy currently being improved. However, these methods have limitations in terms of variance, stability, and sample efficiency, which motivates the need for further improvements.

2. Why aren’t VPG, A3C, and GAE enough?

2a. High Variance in Policy Gradients

Issue: Vanilla Policy Gradient (VPG) methods can suffer from high variance in gradient estimates, making training unstable or slow.
Why This Happens: The Monte Carlo estimation of the return (or advantage) is noisy, especially when sampling trajectories with stochastic policies.
Current Mitigation: GAE reduces variance by trading off bias, but further refinements are needed for robust optimization.

2b. Unstable Updates

Issue: Large policy updates can destabilize training, causing the agent to forget previously learned good behaviors.
Why This Happens: Without mechanisms to constrain how much the policy changes per update, the training process can diverge.

2c. Poor Sample Efficiency

Issue: On-policy algorithms discard data after a single use, even if it's useful for learning. This leads to high computational costs and longer training times.
Why This Matters: In real-world scenarios, data collection (e.g., robotic interaction or real-time system simulations) is expensive.

3. What’s Next and Why?

3a. Improvements in Variance Reduction

Building on GAE and baseline subtraction, the next steps aim to address variance further while introducing stability and efficiency.

Surrogate Loss Functions
- Why?: To create a more stable objective for policy optimization.
- What?: Instead of directly maximizing the policy gradient, surrogate losses like the clipped surrogate objective (used in PPO) limit the policy change, balancing exploration and exploitation.
Step Sizing and Trust Regions
- Why?: To ensure updates do not cause large, destabilizing policy shifts.
- What?: Trust Region Policy Optimization (TRPO) introduces a constraint (based on KL-divergence) to guarantee that updates respect a "trust region" around the current policy.
Proximal Policy Optimization (PPO)
- Why?: TRPO, while effective, is computationally expensive due to its reliance on constrained optimization. PPO simplifies this with a clipping mechanism, making it more efficient and practical.
- What?: PPO is now one of the most widely used on-policy methods due to its simplicity and effectiveness.

3b. Addressing Sample Efficiency

Although still on-policy, the later methods (TRPO, PPO) make better use of sampled data by improving learning stability and requiring fewer rollouts for effective updates.

4. Why is this the logical progression?

From Variance Reduction to Stability:
- GAE introduces variance reduction techniques, which are great but incomplete without mechanisms to stabilize training and ensure robust updates. This sets the stage for surrogate objectives like those in PPO.
From Simple Gradients to Optimized Objectives:
- VPG and A3C maximize the raw policy gradient, but these suffer from poor convergence due to unconstrained updates. Trust regions address this issue.
From Computational Burden to Efficiency:
- TRPO adds constraints but is resource-intensive, so PPO refines the process for practical deployment without sacrificing performance.

4. Summary

This lesson builds on foundational on-policy methods (VPG, A3C, GAE) and highlights their limitations, such as high variance, unstable updates, and poor sample efficiency. To address these, we explore how surrogate losses, step sizing, and trust regions enable stable and efficient policy updates. Advanced algorithms like TRPO and PPO naturally follow, offering solutions that improve variance reduction, stability, and computational feasibility.

Next, we’ll dive into how these approaches revolutionize policy optimization, bridging the gap between theoretical robustness and real-world practicality.