1. Introduction
In the last few lessons, we examined how algorithms like A3C, VPG, and GAE can struggle with stability and efficiency when updating a policy. We also introduced surrogate losses and trust regions as ways to guide policy optimization and ensure controlled changes.
Now, we explore Trust Region Policy Optimization (TRPO)—a game-changing on-policy algorithm that combines:
- A surrogate loss objective (for efficient updates), and
- A KL divergence constraint (for stable, trust-region–like policy changes).
By maintaining a trust region, TRPO ensures that each update to the policy is significant yet safe, preventing performance from collapsing due to overly large steps.
2. TRPO Fundamentals
2a. Key Objective: Surrogate Loss
TRPO aims to maximize a surrogate objective:$$\max_{\pi} \; L(\pi)\;=\;\mathbb{E}_{\pi_{\text{old}}}\Bigl[\frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)}\;A^{\pi_{\text{old}}}(s,a)\Bigr]$$
- $\pi(a \mid s)$ and $\pi_{\text{old}}(a \mid s)$: The new vs. old policy probabilities.
- $A^{\pi_{\text{old}}}(s,a)$: The advantage function under the old policy, indicating how much better (or worse) an action is compared to the baseline.
2b. Constraint: KL Divergence
To ensure the new policy stays close to the old one, TRPO enforces: $$\mathbb{E}_{\pi_{\text{old}}}\bigl[\mathrm{KL}(\pi \,\|\, \pi_{\text{old}})\bigr]\;\leq\;\epsilon,$$
where $\epsilon$ is a small threshold. This prevents large changes that risk destabilizing the policy.
3. TRPO Algorithm Flow
-
Collect Trajectories (T steps or N episodes)
- Gather $\{(s_t, a_t, r_t)\}$ from the environment using the current policy $\pi_{\text{old}}$.
-
Estimate Advantages
- Compute $A^{\pi_{\text{old}}}(s,a)$ using a baseline (value function) or GAE.
-
Compute Policy Gradient
$$g\;=\;\nabla_\theta\,\mathbb{E}_{\pi_{\text{old}}}\Bigl[\frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)}\,A^{\pi_{\text{old}}}(s,a)\Bigr].$$ -
Approximate Fisher Matrix (F) via Conjugate Gradient
- TRPO uses a second-order approach, approximating the Hessian of KL divergence for trust-region enforcement.
-
Solve for $\delta\theta$ and Line Search
- Update the policy parameters $\theta \leftarrow \theta + \alpha\,\delta\theta$, using a line search to keep the KL below $\epsilon$.
-
Repeat
- Continue until convergence or max iterations.
4. Evaluating KL Divergence in TRPO
- Trajectory Probability: $$P(\tau;\theta) = P(s_0)\,\prod_{t=0}^{H-1}\,\pi_\theta(a_t|s_t)\,P(s_{t+1}\mid s_t,a_t)$$
- Key Simplification: Environment dynamics cancel out in KL($\pi\|\pi_{\text{old}}$) because both the new and old policy share the same transitions.
- Sample-Based Approximation: In practice, TRPO approximates $\mathrm{KL}$ using the same rollouts from $\pi_{\text{old}}$.
5. Limitations of TRPO
- Second-Order Optimization: Requires computing or approximating the Fisher matrix—this can be computationally heavy.
- Complex Implementation: Conjugate gradient steps and line searches add overhead.
- Limited Flexibility: Handling more advanced network architectures or stochastic layers can be tricky within TRPO’s second-order framework.
6. Summary
TRPO revolutionized on-policy policy gradients by ensuring stable updates via a KL constraint. Its reliance on a second-order solver, however, makes it more complex. This led to the development of Proximal Policy Optimization (PPO), which approximates TRPO’s trust region constraints through simpler, first-order clipping, making it easier to implement and tune.