PPO: Proximal Policy Optimization

Last updated: December 14, 2024

1. Introduction

Proximal Policy Optimization (PPO) was introduced to address the complexity of TRPO by simplifying the optimization process. PPO retains the key ideas of TRPO—like stable policy updates and trust regions—but eliminates the need for second-order optimization. Instead, PPO incorporates the KL divergence directly into the objective function as a penalty, transforming the constrained optimization problem into an unconstrained one. This simplification allows PPO to leverage standard first-order optimization methods like Adam, making it easier to implement and computationally efficient.

PPO has quickly become one of the most popular algorithms in reinforcement learning due to its balance of simplicity, scalability, and performance

2. PPO v1: Simplifying TRPO

The first version of PPO, PPO v1 , modifies TRPO by embedding the KL-divergence constraint directly into the objective function as a penalty term:
$$
\max_\pi \mathbb{E}_{\pi_{\text{old}}}\left[ \frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}(s, a) - \beta \cdot \text{KL}(\pi || \pi_{\text{old}}) \right]
$$

Key Terms:

$\beta$: A tunable penalty parameter controlling the importance of the KL term.
$\text{KL}(\pi || \pi_{\text{old}})$: Measures how much the new policy deviates from the old policy.
The rest of the terms ($A^{\pi_{\text{old}}}$, importance ratio) remain the same as in TRPO.

Instead of solving a constrained optimization problem with second-order methods, PPO v1 turns the KL constraint into a penalty term and uses first-order optimizers like SGD, Adam, or RMSProp. This makes PPO v1 far more accessible and efficient, especially for large-scale reinforcement learning tasks.

2a. Pseudocode for PPO v1

For each iteration:

Run the current policy for $T$ timesteps or $N$ trajectories.
Estimate the *advantage function* $A(s, a)$ at each timestep.
Optimize the PPO v1 objective using first-order methods (e.g., Adam) for a fixed number of epochs.
Monitor the KL-divergence between the new and old policies:
- If $\text{KL}$ is too large (exceeds threshold $\epsilon$), increase $\beta$.
- If $\text{KL}$ is too small, decrease $\beta$.

This adjustment of $\beta$ dynamically controls the strength of the KL penalty, ensuring stable updates while maximizing the surrogate objective.

2b. Advantages of PPO v1

Simplified Optimization: Eliminates the need for second-order methods, enabling the use of standard first-order optimizers.
Efficient and Scalable: Works seamlessly with modern deep learning frameworks and large neural networks.
Dynamic KL Control: The penalty parameter $\beta$ adjusts automatically to balance stability and performance.

PPO v1 makes policy optimization simpler, more efficient, and easier to integrate into modern RL workflows.

In the next section, we will explore PPO v2, a refinement of PPO v1 that introduces a clever clipping mechanism to further simplify the algorithm while enhancing its performance and robustness.

3. PPO v2: Clipped Surrogate Loss

Proximal Policy Optimization (PPO) v2 builds upon PPO v1 by further simplifying the optimization process. The key innovation is the introduction of a clipped surrogate loss, which ensures stable policy updates without requiring explicit KL divergence penalties or dynamic parameter adjustments. Instead, PPO v2 enforces trust regions through a simple clipping mechanism, making it both easier to implement and computationally efficient.

3a. Mathematical Formulation

Ratio of Policy Probabilities

The ratio between the new policy and the old policy is defined as:
$$
r_\theta(a|s) = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}
$$

If $\pi_\theta(a|s) = \pi_{\text{old}}(a|s)$, then $r_\theta(a|s) = 1$.
As the policy updates, $r_\theta(a|s)$ deviates from 1, reflecting how much the new policy differs from the old one.

Clipped Objective

PPO v2 optimizes the following clipped surrogate loss:
$$
L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_\theta(a|s) \cdot A(s, a), \, \text{clip}\left(r_\theta(a|s), 1 - \epsilon, 1 + \epsilon\right) \cdot A(s, a)\right)\right]
$$

$r_\theta(a|s)$: The probability ratio.
$A(s, a)$: The advantage function, indicating how favorable an action is relative to a baseline.
$\epsilon$: A small hyperparameter that defines the trust region (e.g., $\epsilon = 0.2$).

3b. Key Intuition

Clipping Prevents Large Updates: The clipping mechanism limits $r_\theta(a|s)$ to the range $[1 - \epsilon, 1 + \epsilon]$. This prevents excessively large updates to the policy, ensuring stability during training.
Pessimistic Policy Improvement: The use of $\min$ ensures that the objective always selects the more conservative outcome, reducing the risk of over-optimistic updates.
Balanced Updates: Positive advantages $A(s, a) > 0$ encourage increasing the probability of good actions, while negative advantages $A(s, a) < 0$ discourage poor actions—but only within the defined bounds.

3c. Example of Clipping

If $r_\theta(a|s)$ lies within $[1 - \epsilon, 1 + \epsilon]$, the loss remains unmodified, allowing normal updates.
If $r_\theta(a|s)$ exceeds these bounds, the gradient is clipped, preventing the policy from deviating too far from the old policy.

This mechanism eliminates the need for dynamic KL penalty adjustments used in PPO v1, offering a simpler and more robust approach.

3d. Pseudocode for PPO v2

Initialize policy parameters $\theta$, value function parameters $\phi$, and hyperparameters ($\epsilon, \text{epochs}, \text{batch size}$).
For each iteration:
- Collect trajectories by running the current policy $\pi_\theta$.
- Compute rewards-to-go and advantage estimates $\hat{A}_t$.
- For a specified number of epochs:
  - Sample minibatches of trajectories.
  - Compute the clipped surrogate loss:
    $$
    L(\theta) = \mathbb{E}\left[\min\left(r(\theta)\hat{A}, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\hat{A}\right)\right]
    $$
  - Update $\theta$ using a first-order optimizer (e.g., Adam) to maximize the objective.
- Optionally, update the value function $V_\phi$ by minimizing a regression loss.

The overall structure of the pseudocode is the same for PPO v1 and v2, but the optimization step differs. PPO v2 simplifies the process by removing the need for a dynamically adjusted KL-divergence penalty and instead uses the clipping mechanism, which is easier to implement and computationally efficient.

4. Advantages of PPO

Simpler Implementation: Clipping avoids complex calculations like KL-divergence penalties or second-order optimization.
Stable Updates: The clipping mechanism enforces trust regions naturally, ensuring gradual policy updates.
Efficient and Scalable: PPO v2 is easy to integrate with modern deep learning frameworks and scales well to large neural networks.

PPO v2 has become one of the most widely used reinforcement learning algorithms, offering an excellent balance of simplicity, stability, and performance.

5. Summary

Proximal Policy Optimization (PPO) v2 refines policy optimization by introducing a clipped surrogate loss, a simpler yet effective way to constrain policy updates. Instead of relying on explicit KL divergence penalties like PPO v1, PPO v2 directly limits the deviation of the new policy from the old policy through clipping. This approach ensures stability, simplifies implementation, and maintains computational efficiency. By enforcing trust regions implicitly, PPO v2 offers a robust and scalable solution for modern reinforcement learning tasks, making it one of the most popular algorithms today.

In the following lessons, we’ll dive into the implementation of PPO, demonstrating how the clipped surrogate loss operates in practice and why it is so effective in reinforcement learning.