Surrogate Loss

Last updated: November 25, 2024

1. Introduction

The surrogate loss is a reformulation of the policy gradient objective that allows us to use previously collected trajectories while maintaining valid gradient estimates. It achieves this by leveraging importance sampling, which reweights trajectory contributions to account for differences between the current policy and the one under which the data was collected.

This technique ensures that policy updates remain both efficient and stable:

Efficiency through Reuse: By incorporating data from older policies, the surrogate loss improves sample efficiency, making better use of previously collected trajectories.
Stability through Constraints: The surrogate loss is often paired with mechanisms like clipping (in PPO) or trust regions (in TRPO) to limit updates, ensuring smooth and consistent learning progress.

Instead of directly optimizing the expected return of the new policy, the surrogate loss focuses on a proxy objective that aligns well with the true objective while being computationally and statistically more robust. This makes it a cornerstone for modern on-policy methods like TRPO and PPO.

2. Deriving the Surrogate Loss

2a. Utility Function $ U(\theta) $

The goal of reinforcement learning is to optimize a policy $ \pi_\theta $ to maximize the expected return:

$$
U(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
$$

Here, $ \tau $ is a trajectory (sequence of states, actions, and rewards), and $ R(\tau) $ is the cumulative reward.
Since directly sampling from $ \pi_\theta $ for every update is inefficient, we use importance sampling with a prior policy $ \pi_{\theta_\text{old}} $:

$$
U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} R(\tau) \right]
$$

The term $\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)}$ adjusts for the fact that samples come from $ \pi_{\theta_\text{old}} $, not $ \pi_\theta $.

2b. Gradient of $ U(\theta) $

To improve $ \pi_\theta $, we calculate the gradient:

$$
\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \nabla_\theta \frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} R(\tau) \right]
$$

Expanding:

$$$ \nabla_\theta \frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} = \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} = \frac{\pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} $$$

Substituting back:

$$
\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \nabla_\theta \log \pi_\theta(\tau) R(\tau) \right]
$$

2c. Surrogate Loss Function

The surrogate loss offers a practical way to optimize the policy while reusing data collected from a previous policy $\pi_{\theta_\text{old}}$. It reformulates the objective to:

$$
L_s(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} R(\tau) \right]
$$

Here, the term $\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)}$ (the importance sampling ratio) adjusts the contribution of each trajectory to reflect the difference between the current policy $ \pi_\theta $ and the policy $ \pi_{\theta_\text{old}} $ under which the trajectory was collected. This ensures the gradient estimates remain unbiased while enabling reuse of previously sampled data.

In practice, the surrogate loss:

Improves Sample Efficiency: By leveraging data from earlier policies, it reduces the number of new samples required during training.
Enables Stable Policy Updates: Modern implementations like PPO and TRPO introduce mechanisms (e.g., clipping or trust regions) to limit how far the policy $ \pi_\theta $ can deviate from $ \pi_{\theta_\text{old}} $ in a single update. This avoids overly aggressive updates that could destabilize learning.

Thus, $L_s(\theta)$ serves as an approximation to the true objective $U(\theta)$, balancing fidelity to the original goal with computational efficiency and stability. This reformulation is the key to the success of modern policy optimization algorithms.

3. Why It Works

Sample Efficiency: Reuses trajectories from $ \pi_{\theta_\text{old}} $, reducing the need for frequent rollouts.
Stability: Constrains policy updates, ensuring smoother learning.
Flexibility: Supports optimization over broader policy classes via the importance sampling ratio $\frac{\pi_\theta}{\pi_{\theta_\text{old}}}$.

This surrogate loss is the foundation for modern policy optimization techniques like TRPO and PPO.