1. Introduction
The surrogate loss is a reformulation of the policy gradient objective that allows us to use previously collected trajectories while maintaining valid gradient estimates. It achieves this by leveraging importance sampling, which reweights trajectory contributions to account for differences between the current policy and the one under which the data was collected.
This technique ensures that policy updates remain both efficient and stable:
- Efficiency through Reuse: By incorporating data from older policies, the surrogate loss improves sample efficiency, making better use of previously collected trajectories.
- Stability through Constraints: The surrogate loss is often paired with mechanisms like clipping (in PPO) or trust regions (in TRPO) to limit updates, ensuring smooth and consistent learning progress.
Instead of directly optimizing the expected return of the new policy, the surrogate loss focuses on a proxy objective that aligns well with the true objective while being computationally and statistically more robust. This makes it a cornerstone for modern on-policy methods like TRPO and PPO.
2. Deriving the Surrogate Loss
2a. Utility Function $ U(\theta) $
The goal of reinforcement learning is to optimize a policy $ \pi_\theta $ to maximize the expected return:
$$
U(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
$$
Here, $ \tau $ is a trajectory (sequence of states, actions, and rewards), and $ R(\tau) $ is the cumulative reward.
Since directly sampling from $ \pi_\theta $ for every update is inefficient, we use importance sampling with a prior policy $ \pi_{\theta_\text{old}} $:
$$
U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} R(\tau) \right]
$$
The term $\frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)}$ adjusts for the fact that samples come from $ \pi_{\theta_\text{old}} $, not $ \pi_\theta $.
2b. Gradient of $ U(\theta) $
To improve $ \pi_\theta $, we calculate the gradient:
$$
\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \nabla_\theta \frac{\pi_\theta(\tau)}{\pi_{\theta_\text{old}}(\tau)} R(\tau) \right]
$$
Expanding:
Substituting back:
$$
\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \left[ \nabla_\theta \log \pi_\theta(\tau) R(\tau) \right]
$$
2c. Surrogate Loss Function
3. Why It Works
- Sample Efficiency: Reuses trajectories from $ \pi_{\theta_\text{old}} $, reducing the need for frequent rollouts.
- Stability: Constrains policy updates, ensuring smoother learning.
- Flexibility: Supports optimization over broader policy classes via the importance sampling ratio $\frac{\pi_\theta}{\pi_{\theta_\text{old}}}$.
This surrogate loss is the foundation for modern policy optimization techniques like TRPO and PPO.