Proximal Policy Optimization (PPO)

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

Proximal Policy Optimization (PPO) addresses TRPO’s practical difficulties. By incorporating a KL penalty or clipped objective, PPO eliminates the need for second-order optimization while keeping the essential idea of bounded policy updates.

This yields an algorithm that’s:

  • Easier to implement.
  • Computationally efficient.
  • Surprisingly effective across a wide range of tasks.

2. PPO v1: KL Penalty Formulation

2a. Embedding KL in the Objective

Instead of a constrained optimization, PPO v1 adds a penalty term:

$$\max_{\pi}\;\mathbb{E}_{\pi_{\text{old}}}\Bigl[\frac{\pi(a\mid s)}{\pi_{\text{old}}(a\mid s)}\,A^{\pi_{\text{old}}}(s,a)\;-\;\beta \,\mathrm{KL}(\pi \,\|\, \pi_{\text{old}})\Bigr],$$

where $\beta$ is a penalty coefficient controlling how heavily KL divergence is punished.

2b. Practical Flow

  1. Roll Out $\pi_{\text{old}}$.
  2. Compute advantages $A^{\pi_{\text{old}}}$.
  3. Optimize the penalized surrogate with a first-order method (e.g., Adam).
  4. Adjust $\beta$ dynamically if the KL gets too large or too small.

2c. Advantages & Drawbacks

  • Simplified vs. second-order TRPO.
  • Dynamic $\beta$ tuning can be tricky.
  • Still must track and control KL divergence each iteration.

3. PPO v2: Clipped Surrogate Loss

PPO v2 refines the idea further by clipping the policy ratio:

$$r_\theta(a\mid s)= \frac{\pi_\theta(a\mid s)}{\pi_{\text{old}}(a\mid s)}.$$

Then the clipped objective is:

$$L^\text{CLIP}(\theta)= \mathbb{E}\Bigl[\min\Bigl(r_\theta(a\mid s)\,A(s,a),\;\mathrm{clip}(r_\theta(a\mid s),\,1-\epsilon,\, 1+\epsilon)\,\times\,A(s,a)\Bigr)\Bigr].$$

3a. Intuition

  • Clipping ensures $r_\theta$ stays close to 1 (i.e., new policy near the old one).
  • $\epsilon$ is a hyperparameter (e.g., 0.2) controlling how far the ratio can deviate.
  • The final objective picks the more conservative value between the unclipped and clipped advantage.

3b. Why Clipping Helps

  • No separate penalty or line search needed—just a straightforward loss function.
  • Stable updates: If the policy tries to move too far, the clip “caps” the incentive, preventing destructive policy shifts.

4. PPO Algorithm Flow

  1. Collect Trajectories under $\pi_{\text{old}}$.
  2. Compute rewards-to-go, advantages, etc.
  3. Optimize clipped objective in minibatches for a few epochs:
    $$ L^\text{CLIP}(\theta)=\;\mathbb{E}\Bigl[\min\bigl(r_\theta\,A,\,\mathrm{clip}(r_\theta,\,1-\epsilon,1+\epsilon)\times A\bigr)\Bigr].$$
  4. Update $\theta$.
  5. Repeat until convergence.

5. Advantages of PPO

  • Easier Implementation: Avoids second-order methods or dynamic KL penalties.
  • Robust: Empirically effective across diverse tasks (Lunar Lander, Atari, MuJoCo, robotics).
  • Scalable: Integrates well with modern deep learning frameworks.

6. Summary

PPO retains TRPO’s core principle of bounded policy updates but simplifies it via a clipped surrogate. This approach has made PPO one of the most popular and successful on-policy RL algorithms:

  • TRPO → Strict trust region via second-order method & line search.
  • PPO v1 → First-order method with a KL penalty.
  • PPO v2 → Clipped surrogate objective for implicit trust regions.

Because of its balance of simplicity and performance, PPO has become a go-to method for many practical RL challenges.

Final Notes

  1. TRPO introduced a trust-region approach that guarantees stable policy improvements but requires second-order optimization.
  2. PPO simplified the approach by incorporating the KL constraint into the objective itself—either as a penalty (PPO v1) or by clipping (PPO v2).
  3. Both algorithms build upon the concept of a surrogate loss and bounded updates to ensure stable on-policy learning.

In practice, PPO often wins out for its ease of implementation, while TRPO provides a more theoretically grounded guarantee of monotonic improvement—though at higher computational cost.

Previous Lesson