Surrogate Loss, Step Sizing, and Trust Regions in On-Policy RL

Deep Reinforcement Learning

Last updated: January 01, 2025

Introduction

Why On-Policy Methods Need These Concepts?

In on-policy policy gradient methods (like TRPO or PPO), the agent collects trajectories using its current policy, then updates that policy based on the newly collected data. However:

  1. Data Reuse is difficult—once the policy changes, old data might seem “stale.”
  2. Large Policy Updates can devastate performance, because future data collection relies on the quality of the updated policy.

We introduce two ideas to mitigate these challenges:

Part I: Surrogate Loss for On-Policy Methods

1. Motivation

2. The Surrogate Loss: Foundations

2a. The True Objective $U(\theta)$

Our ultimate goal is to maximize expected return: $$U(\theta)= \mathbb{E}_{\tau \sim \pi_\theta}[\,R(\tau)\,].$$

However, sampling directly from $\pi_\theta$ for every update is costly. Instead, we use importance sampling on old data from $\pi_{\theta_\text{old}}$:

$$U(\theta)= \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}\Bigl[\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} \, R(\tau)\Bigr].$$

The ratio $\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)}$ corrects for the mismatch in data collection vs. the new policy.

2b. Gradient of $U(\theta)$

Taking the gradient, $$\nabla_\theta U(\theta)= \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}\Bigl[\nabla_\theta \,\log \pi_\theta(\tau)\, R(\tau)\Bigr].$$

This shows that we can update $\pi_\theta$ using old trajectories—as long as we properly account for the difference between $\pi_\theta$ and $\pi_{\theta_\text{old}}$.

2c. Practical Approximation: Surrogate Loss

In code and practice, we define a surrogate objective:
$$L_s(\theta)= \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}\Bigl[\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} \, R(\tau)\Bigr].$$

This maintains an unbiased gradient (via importance sampling) and allows reuse of old samples, improving sample efficiency.

3. Why the Surrogate Loss Works

  1. Sample Efficiency: We avoid discarding old trajectories, which is especially critical in high-dimensional or costly environments.
  2. Stability: Constraints (like clipping in PPO or trust regions in TRPO) can keep $\pi_\theta$ close to $\pi_{\theta_\text{old}}$, preventing huge changes that destabilize the policy.
  3. Flexibility: Importance sampling addresses the mismatch between old and new policies mathematically, making it applicable to many policy parameterizations.

Part II: Step Sizing & Trust Regions in RL

Even with a surrogate loss that reuses old data, how far we move along the gradient each update is critical. In reinforcement learning:

1. Why Step Size Matters

  1. Local Approximation: Policy gradients are reliable only for small updates. Large steps violate the assumptions behind the gradient direction.
  2. Dynamic Data: Unlike supervised learning, RL’s dataset is constantly changing with the policy. A “bad step” leads to worse future data, not just a temporary drop in performance.
  3. Catastrophic Failure: Overshooting the optimum can yield a severely degraded policy that might never recover.

2. Challenges with Step Sizing

3. Trust Regions

A trust region is a bounded neighborhood around the old policy within which the gradient step is assumed valid. We limit the policy update to stay inside this region, preventing drastic changes in behavior.

3a. Benefits of Trust Regions

  1. Consistency: Updates remain within a region where the surrogate objective is accurate.
  2. Efficiency: Minimizes retries or collapses, thus reducing wasted compute resources.
  3. Empirical Success: TRPO and PPO demonstrate how bounding policy updates yields stable and effective on-policy optimization.

Summary & Next Steps

Up Next:

  1. Trust Region Policy Optimization (TRPO): Formalizes the notion of KL-based trust regions, ensuring stable updates that guarantee improvement.
  2. Proximal Policy Optimization (PPO): A more practical, often easier-to-tune variant that uses a clipped objective rather than explicit KL constraints.

By balancing data reuse (surrogate loss) and bounded updates (trust regions), on-policy RL methods can steadily improve performance without risking catastrophic failures—core principles behind state-of-the-art algorithms in policy gradient reinforcement learning.

Previous Lesson Next Lesson