Surrogate Loss, Step Sizing, and Trust Regions in On-Policy RL
Last updated: January 01, 2025
Introduction
Why On-Policy Methods Need These Concepts?
In on-policy policy gradient methods (like TRPO or PPO), the agent collects trajectories using its current policy, then updates that policy based on the newly collected data. However:
- Data Reuse is difficult—once the policy changes, old data might seem “stale.”
- Large Policy Updates can devastate performance, because future data collection relies on the quality of the updated policy.
We introduce two ideas to mitigate these challenges:
- Surrogate Loss (via importance sampling) to reuse old trajectories in a principled way, preventing waste and improving stability.
- Step Sizing/Trust Regions to limit how far the policy changes each update, avoiding catastrophic dips in performance that stall learning.
Part I: Surrogate Loss for On-Policy Methods
1. Motivation
- On-Policy Data: Typically, on-policy RL discards old data once the policy changes. This is sample-inefficient and can slow progress in environments where rollouts are expensive.
- Importance Sampling: A surrogate loss reweights old trajectories, letting them remain useful for the new policy without introducing bias. This fosters stability and efficiency.
2. The Surrogate Loss: Foundations
2a. The True Objective $U(\theta)$
Our ultimate goal is to maximize expected return: $$U(\theta)= \mathbb{E}_{\tau \sim \pi_\theta}[\,R(\tau)\,].$$
However, sampling directly from $\pi_\theta$ for every update is costly. Instead, we use importance sampling on old data from $\pi_{\theta_\text{old}}$:
$$U(\theta)= \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}\Bigl[\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} \, R(\tau)\Bigr].$$
The ratio $\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)}$ corrects for the mismatch in data collection vs. the new policy.
2b. Gradient of $U(\theta)$
Taking the gradient, $$\nabla_\theta U(\theta)= \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}\Bigl[\nabla_\theta \,\log \pi_\theta(\tau)\, R(\tau)\Bigr].$$
This shows that we can update $\pi_\theta$ using old trajectories—as long as we properly account for the difference between $\pi_\theta$ and $\pi_{\theta_\text{old}}$.
2c. Practical Approximation: Surrogate Loss
In code and practice, we define a surrogate objective:
$$L_s(\theta)= \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}\Bigl[\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} \, R(\tau)\Bigr].$$
This maintains an unbiased gradient (via importance sampling) and allows reuse of old samples, improving sample efficiency.
3. Why the Surrogate Loss Works
- Sample Efficiency: We avoid discarding old trajectories, which is especially critical in high-dimensional or costly environments.
- Stability: Constraints (like clipping in PPO or trust regions in TRPO) can keep $\pi_\theta$ close to $\pi_{\theta_\text{old}}$, preventing huge changes that destabilize the policy.
- Flexibility: Importance sampling addresses the mismatch between old and new policies mathematically, making it applicable to many policy parameterizations.
Part II: Step Sizing & Trust Regions in RL
Even with a surrogate loss that reuses old data, how far we move along the gradient each update is critical. In reinforcement learning:
- A single bad update can collapse the policy.
- We collect new data using the updated policy. If it’s too poor, subsequent rollouts become low-quality, creating a downward spiral.
1. Why Step Size Matters
- Local Approximation: Policy gradients are reliable only for small updates. Large steps violate the assumptions behind the gradient direction.
- Dynamic Data: Unlike supervised learning, RL’s dataset is constantly changing with the policy. A “bad step” leads to worse future data, not just a temporary drop in performance.
- Catastrophic Failure: Overshooting the optimum can yield a severely degraded policy that might never recover.
2. Challenges with Step Sizing
- Too Small: Convergence is painstakingly slow.
- Too Large: We risk extreme changes that degrade or “break” the policy.
- Naive Line Search: Doing multiple rollouts just to find the right step is time-consuming and ignores the local validity of the gradient approximation.
3. Trust Regions
A trust region is a bounded neighborhood around the old policy within which the gradient step is assumed valid. We limit the policy update to stay inside this region, preventing drastic changes in behavior.
- Prevents Overly Aggressive Steps: If the new policy is “too far,” it’s scaled back to ensure a safe update.
- Maintains Data Quality: The new policy remains competent enough to collect decent data next iteration.
- Theoretical Guarantees: Methods like TRPO guarantee monotonic improvement under certain assumptions.
3a. Benefits of Trust Regions
- Consistency: Updates remain within a region where the surrogate objective is accurate.
- Efficiency: Minimizes retries or collapses, thus reducing wasted compute resources.
- Empirical Success: TRPO and PPO demonstrate how bounding policy updates yields stable and effective on-policy optimization.
Summary & Next Steps
- Surrogate Loss: Enables reusing old data in a principled way, forming the core of many modern on-policy methods (TRPO, PPO).
- Step Sizing & Trust Regions: Constrain how much the policy changes, preventing catastrophic degradation.
- Combined Effect: This synergy leads to stable and sample-efficient on-policy methods capable of tackling complex environments.
Up Next:
- Trust Region Policy Optimization (TRPO): Formalizes the notion of KL-based trust regions, ensuring stable updates that guarantee improvement.
- Proximal Policy Optimization (PPO): A more practical, often easier-to-tune variant that uses a clipped objective rather than explicit KL constraints.
By balancing data reuse (surrogate loss) and bounded updates (trust regions), on-policy RL methods can steadily improve performance without risking catastrophic failures—core principles behind state-of-the-art algorithms in policy gradient reinforcement learning.