Stabilizing and Optimizing On-Policy Learning

Reinforcement Learning

Last updated: November 25, 2024

1. Introduction

On-policy methods like VPG, A3C, and GAE establish the foundational concepts for learning from trajectories while staying true to the policy currently being improved. However, these methods have limitations in terms of variance, stability, and sample efficiency, which motivates the need for further improvements.

2. Why aren’t VPG, A3C, and GAE enough?

2a. High Variance in Policy Gradients

2b. Unstable Updates

2c. Poor Sample Efficiency

3. What’s Next and Why?

3a. Improvements in Variance Reduction

Building on GAE and baseline subtraction, the next steps aim to address variance further while introducing stability and efficiency.

  1. Surrogate Loss Functions

    • Why?: To create a more stable objective for policy optimization.
    • What?: Instead of directly maximizing the policy gradient, surrogate losses like the clipped surrogate objective (used in PPO) limit the policy change, balancing exploration and exploitation.
  2. Step Sizing and Trust Regions

    • Why?: To ensure updates do not cause large, destabilizing policy shifts.
    • What?: Trust Region Policy Optimization (TRPO) introduces a constraint (based on KL-divergence) to guarantee that updates respect a "trust region" around the current policy.
  3. Proximal Policy Optimization (PPO)

    • Why?: TRPO, while effective, is computationally expensive due to its reliance on constrained optimization. PPO simplifies this with a clipping mechanism, making it more efficient and practical.
    • What?: PPO is now one of the most widely used on-policy methods due to its simplicity and effectiveness.

3b. Addressing Sample Efficiency

Although still on-policy, the later methods (TRPO, PPO) make better use of sampled data by improving learning stability and requiring fewer rollouts for effective updates.

4. Why is this the logical progression?

  1. From Variance Reduction to Stability:
    • GAE introduces variance reduction techniques, which are great but incomplete without mechanisms to stabilize training and ensure robust updates. This sets the stage for surrogate objectives like those in PPO.
  2. From Simple Gradients to Optimized Objectives:
    • VPG and A3C maximize the raw policy gradient, but these suffer from poor convergence due to unconstrained updates. Trust regions address this issue.
  3. From Computational Burden to Efficiency:
    • TRPO adds constraints but is resource-intensive, so PPO refines the process for practical deployment without sacrificing performance.

4. Summary

This lesson builds on foundational on-policy methods (VPG, A3C, GAE) and highlights their limitations, such as high variance, unstable updates, and poor sample efficiency. To address these, we explore how surrogate losses, step sizing, and trust regions enable stable and efficient policy updates. Advanced algorithms like TRPO and PPO naturally follow, offering solutions that improve variance reduction, stability, and computational feasibility.

Next, we’ll dive into how these approaches revolutionize policy optimization, bridging the gap between theoretical robustness and real-world practicality.

Next Lesson