1. Introduction
In previous lessons, we explored the limitations of foundational on-policy methods like Vanilla Policy Gradient (VPG), A3C, and Generalized Advantage Estimation (GAE). While these approaches introduced techniques to reduce variance and stabilize training, they struggle to ensure reliable policy updates.
To address this, we introduced the concept of a surrogate loss, which simplifies optimization by providing a more stable objective. However, key questions remain:
- How can we ensure updates move in the right direction without destabilizing the policy?
- How do we guarantee consistent progress without risking catastrophic degradation?
These challenges lead us to step-sizing and trust regions. Properly controlling policy updates prevents poor adjustments and maintains stability. This lesson highlights the importance of step-sizing in reinforcement learning, its unique challenges compared to supervised learning, and introduces the Trust Region Policy Optimization (TRPO) algorithm, a milestone in stabilizing on-policy methods.
2. Why Step Size Matters
2a. What Is Step Size?
Step size determines how far the policy is updated along the gradient direction. While the gradient indicates the best local direction for improvement, it doesn’t specify how far to move. Selecting the right step size is critical:
- Too Small: Slows progress, requiring many updates for noticeable improvement.
- Too Large: Risks overshooting the optimal policy, causing instability or failure.
2b. Why Is Step Sizing Necessary?
Gradient-based updates are only reliable within a limited range. Moving too far invalidates the gradient’s approximation, leading to harmful updates instead of beneficial ones.
2c. Key Challenges of Step Sizing
-
First-Order Approximation Limitations:
Gradients are local approximations and don’t account for non-linearities or constraints. Large steps assume the gradient direction remains accurate, which isn’t true beyond a small neighborhood. -
Differences Between Supervised Learning and RL:
- Supervised Learning: Bad step sizes are recoverable because the dataset is fixed, allowing the model to learn from consistent data in subsequent updates.
- Reinforcement Learning: Poor updates lead to degraded policies, which collect low-quality data, creating a feedback loop that prevents recovery and stalls learning.
-
Bad Step Sizes in RL:
- Wasted resources on poor updates or interactions under suboptimal policies.
- Catastrophic degradation requiring complete retraining or inefficient recovery methods like shrinking the step size.
2d. Simplest Approach: Line Search
A simple method for step sizing is line search, which tests various step sizes by evaluating their performance through rollouts. While straightforward, this approach has significant drawbacks:
- Computational Expense: Requires multiple evaluations along the line, increasing training time.
- Naivety: Ignores where the gradient approximation is reliable, often leading to suboptimal updates.
2e. A More Sophisticated Approach
More advanced methods incorporate concepts like the surrogate loss, which leverage additional policy structure information to make better-informed decisions about step sizes. These approaches provide a balance between computational efficiency and effective policy improvement, avoiding the pitfalls of naive techniques.
2f. Consequences of Poor Step Sizing
- Cascading Errors: A single bad update degrades the policy, leading to suboptimal data and amplified errors.
- Irrecoverable Policies: Unlike supervised learning, RL lacks a static dataset, so a degraded policy may stall learning entirely.
- Wasted Resources: Poor updates waste time and compute on flawed data collection and restarts.