1. Introduction
Proximal Policy Optimization (PPO) addresses TRPO’s practical difficulties. By incorporating a KL penalty or clipped objective, PPO eliminates the need for second-order optimization while keeping the essential idea of bounded policy updates.
This yields an algorithm that’s:
- Easier to implement.
- Computationally efficient.
- Surprisingly effective across a wide range of tasks.
2. PPO v1: KL Penalty Formulation
2a. Embedding KL in the Objective
Instead of a constrained optimization, PPO v1 adds a penalty term:
$$\max_{\pi}\;\mathbb{E}_{\pi_{\text{old}}}\Bigl[\frac{\pi(a\mid s)}{\pi_{\text{old}}(a\mid s)}\,A^{\pi_{\text{old}}}(s,a)\;-\;\beta \,\mathrm{KL}(\pi \,\|\, \pi_{\text{old}})\Bigr],$$
where $\beta$ is a penalty coefficient controlling how heavily KL divergence is punished.
2b. Practical Flow
- Roll Out $\pi_{\text{old}}$.
- Compute advantages $A^{\pi_{\text{old}}}$.
- Optimize the penalized surrogate with a first-order method (e.g., Adam).
- Adjust $\beta$ dynamically if the KL gets too large or too small.
2c. Advantages & Drawbacks
- Simplified vs. second-order TRPO.
- Dynamic $\beta$ tuning can be tricky.
- Still must track and control KL divergence each iteration.
3. PPO v2: Clipped Surrogate Loss
PPO v2 refines the idea further by clipping the policy ratio:
$$r_\theta(a\mid s)= \frac{\pi_\theta(a\mid s)}{\pi_{\text{old}}(a\mid s)}.$$
Then the clipped objective is:
$$L^\text{CLIP}(\theta)= \mathbb{E}\Bigl[\min\Bigl(r_\theta(a\mid s)\,A(s,a),\;\mathrm{clip}(r_\theta(a\mid s),\,1-\epsilon,\, 1+\epsilon)\,\times\,A(s,a)\Bigr)\Bigr].$$
3a. Intuition
- Clipping ensures $r_\theta$ stays close to 1 (i.e., new policy near the old one).
- $\epsilon$ is a hyperparameter (e.g., 0.2) controlling how far the ratio can deviate.
- The final objective picks the more conservative value between the unclipped and clipped advantage.
3b. Why Clipping Helps
- No separate penalty or line search needed—just a straightforward loss function.
- Stable updates: If the policy tries to move too far, the clip “caps” the incentive, preventing destructive policy shifts.
4. PPO Algorithm Flow
- Collect Trajectories under $\pi_{\text{old}}$.
- Compute rewards-to-go, advantages, etc.
- Optimize clipped objective in minibatches for a few epochs:
$$ L^\text{CLIP}(\theta)=\;\mathbb{E}\Bigl[\min\bigl(r_\theta\,A,\,\mathrm{clip}(r_\theta,\,1-\epsilon,1+\epsilon)\times A\bigr)\Bigr].$$ - Update $\theta$.
- Repeat until convergence.
5. Advantages of PPO
- Easier Implementation: Avoids second-order methods or dynamic KL penalties.
- Robust: Empirically effective across diverse tasks (Lunar Lander, Atari, MuJoCo, robotics).
- Scalable: Integrates well with modern deep learning frameworks.
6. Summary
PPO retains TRPO’s core principle of bounded policy updates but simplifies it via a clipped surrogate. This approach has made PPO one of the most popular and successful on-policy RL algorithms:
- TRPO → Strict trust region via second-order method & line search.
- PPO v1 → First-order method with a KL penalty.
- PPO v2 → Clipped surrogate objective for implicit trust regions.
Because of its balance of simplicity and performance, PPO has become a go-to method for many practical RL challenges.
Final Notes
- TRPO introduced a trust-region approach that guarantees stable policy improvements but requires second-order optimization.
- PPO simplified the approach by incorporating the KL constraint into the objective itself—either as a penalty (PPO v1) or by clipping (PPO v2).
- Both algorithms build upon the concept of a surrogate loss and bounded updates to ensure stable on-policy learning.
In practice, PPO often wins out for its ease of implementation, while TRPO provides a more theoretically grounded guarantee of monotonic improvement—though at higher computational cost.