Trust Region Policy Optimization (TRPO)

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

In the last few lessons, we examined how algorithms like A3C, VPG, and GAE can struggle with stability and efficiency when updating a policy. We also introduced surrogate losses and trust regions as ways to guide policy optimization and ensure controlled changes.

Now, we explore Trust Region Policy Optimization (TRPO)a game-changing on-policy algorithm that combines:

  • A surrogate loss objective (for efficient updates), and
  • A KL divergence constraint (for stable, trust-region–like policy changes).

By maintaining a trust region, TRPO ensures that each update to the policy is significant yet safe, preventing performance from collapsing due to overly large steps.

2. TRPO Fundamentals

2a. Key Objective: Surrogate Loss

TRPO aims to maximize a surrogate objective:$$\max_{\pi} \; L(\pi)\;=\;\mathbb{E}_{\pi_{\text{old}}}\Bigl[\frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)}\;A^{\pi_{\text{old}}}(s,a)\Bigr]$$

  • $\pi(a \mid s)$ and $\pi_{\text{old}}(a \mid s)$: The new vs. old policy probabilities.
  • $A^{\pi_{\text{old}}}(s,a)$: The advantage function under the old policy, indicating how much better (or worse) an action is compared to the baseline.

2b. Constraint: KL Divergence

To ensure the new policy stays close to the old one, TRPO enforces: $$\mathbb{E}_{\pi_{\text{old}}}\bigl[\mathrm{KL}(\pi \,\|\, \pi_{\text{old}})\bigr]\;\leq\;\epsilon,$$

where $\epsilon$ is a small threshold. This prevents large changes that risk destabilizing the policy.

3. TRPO Algorithm Flow

  1. Collect Trajectories (T steps or N episodes)

    • Gather $\{(s_t, a_t, r_t)\}$ from the environment using the current policy $\pi_{\text{old}}$.
  2. Estimate Advantages

    • Compute $A^{\pi_{\text{old}}}(s,a)$ using a baseline (value function) or GAE.
  3. Compute Policy Gradient

    $$g\;=\;\nabla_\theta\,\mathbb{E}_{\pi_{\text{old}}}\Bigl[\frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)}\,A^{\pi_{\text{old}}}(s,a)\Bigr].$$
  4. Approximate Fisher Matrix (F) via Conjugate Gradient

    • TRPO uses a second-order approach, approximating the Hessian of KL divergence for trust-region enforcement.
  5. Solve for $\delta\theta$ and Line Search

    • Update the policy parameters $\theta \leftarrow \theta + \alpha\,\delta\theta$, using a line search to keep the KL below $\epsilon$.
  6. Repeat

    • Continue until convergence or max iterations.

4. Evaluating KL Divergence in TRPO

  • Trajectory Probability: $$P(\tau;\theta) = P(s_0)\,\prod_{t=0}^{H-1}\,\pi_\theta(a_t|s_t)\,P(s_{t+1}\mid s_t,a_t)$$
  • Key Simplification: Environment dynamics cancel out in KL($\pi\|\pi_{\text{old}}$) because both the new and old policy share the same transitions.
  • Sample-Based Approximation: In practice, TRPO approximates $\mathrm{KL}$ using the same rollouts from $\pi_{\text{old}}$.

5. Limitations of TRPO

  1. Second-Order Optimization: Requires computing or approximating the Fisher matrix—this can be computationally heavy.
  2. Complex Implementation: Conjugate gradient steps and line searches add overhead.
  3. Limited Flexibility: Handling more advanced network architectures or stochastic layers can be tricky within TRPO’s second-order framework.

6. Summary

TRPO revolutionized on-policy policy gradients by ensuring stable updates via a KL constraint. Its reliance on a second-order solver, however, makes it more complex. This led to the development of Proximal Policy Optimization (PPO), which approximates TRPO’s trust region constraints through simpler, first-order clipping, making it easier to implement and tune.

Previous Lesson Next Lesson