1. Introduction
In the last few lessons, we explored the limitations of methods like A3C, VPG, and GAE, particularly in achieving stable and efficient policy updates. We then introduced surrogate loss as a tool to guide optimization and discussed step sizing and trust regions to ensure updates remain stable and controlled.
Now, we take these ideas further with Trust Region Policy Optimization (TRPO). The main idea behind TRPO is to optimize a policy by maximizing a surrogate loss while ensuring the new policy stays close to the old one, measured using a KL divergence constraint. This prevents large, destabilizing updates and keeps performance consistent.
TRPO is a game-changer in reinforcement learning, combining stability with efficient learning, and it directly addresses many of the challenges we’ve seen so far. Let’s dive into how it works and why it’s such a powerful algorithm.
2. Trust Region Policy Optimization (TRPO)
Trust Region Policy Optimization (TRPO) is a powerful algorithm for reinforcement learning that addresses the challenges of unstable and inefficient policy updates. It optimizes policies by maximizing a surrogate loss while ensuring the updated policy ($\pi$) does not deviate significantly from the old policy ($\pi_{\text{old}}$) using a KL divergence constraint. This mechanism keeps policy updates stable, preventing large changes that could lead to performance drops.
2a. Objective: Maximizing the Surrogate Loss
The goal of TRPO is to optimize the following objective function:
$$
\max_\pi L(\pi) = \mathbb{E}_{\pi_{\text{old}}} \left[ \frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}(s, a) \right]
$$
Explanation:
- $\pi(a|s)$: The probability of taking action $a$ under the new policy $\pi$.
- $\pi_{\text{old}}(a|s)$: The probability of taking action $a$ under the old policy $\pi_{\text{old}}$.
- $A^{\pi_{\text{old}}}(s, a)$: The advantage function, which estimates the benefit of taking action $a$ in state $s$ compared to the average action.
This surrogate objective focuses on improving actions that were advantageous under the old policy, weighted by the ratio of the new and old policies.
2b. Constraint: KL Divergence
$$
\mathbb{E}_{\pi_{\text{old}}} \left[ KL(\pi \| \pi_{\text{old}}) \right] \leq \epsilon
$$
Explanation:
- The KL divergence constraint ensures that the new policy $\pi$ remains close to the old policy $\pi_{\text{old}}$.
- $\epsilon$ is a small threshold controlling how much the policy can change in one update, maintaining stability and preventing large, destabilizing jumps.
3. Algorithm Flow
3a. Run Policy for Timesteps or N Trajectories
Collect trajectories (sequences of states, actions, and rewards) by interacting with the environment using the current policy.
Mathematical Output: $$\{(s_t, a_t, r_t, s_{t+1})\}_{t=1}^{T}$$
3b. Estimate Advantage Function
Compute $A^{\pi_{\text{old}}}(s, a)$ using a baseline $V(s)$ (e.g., the value function): $$ A^{\pi_{\text{old}}}(s, a) = Q(s, a) - V(s) $$
Where: $$ Q(s, a) = \sum_{t=0}^\infty \gamma^t r_t \quad \text{(discounted future rewards)}. $$
Explanation:
The advantage function measures how much better taking action aa is compared to the baseline expectation of the state.
3c. Compute Policy Gradient ($ g $)
The policy gradient is computed using the surrogate loss:
$$
g = \nabla_\theta \mathbb{E}_{\pi_{\text{old}}} \left[ \frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}(s, a) \right]
$$
Explanation:
This gradient provides the direction in parameter space $\theta$ to improve the policy.
3d. Use Conjugate Gradient (CG) to Compute $F^{-1}g$
Compute the step direction $\delta \theta$ using the Fisher information matrix $F$:
$$
F = \mathbb{E}_{\pi_{\text{old}}} \left[ \nabla_\theta KL(\pi \| \pi_{\text{old}}) \cdot \nabla_\theta KL(\pi \| \pi_{\text{old}})^\top \right]
$$
The CG method approximates $F^{-1}g$ efficiently without directly inverting $F$.
3e. Line Search on Surrogate Loss and KL Constraint
Perform a line search to find the step size $\alpha$ that satisfies:
$$
L(\pi_{\text{new}}) > L(\pi_{\text{old}})
$$
while ensuring: $$ \mathbb{E}_{\pi_{\text{old}}} \left[ KL(\pi_{\text{new}} \| \pi_{\text{old}}) \right] \leq \epsilon $$
Explanation:
- The line search adjusts $\alpha$ to balance maximizing the surrogate loss and staying within the KL constraint.
- This ensures stable and reliable policy updates.
3f. End of Iteration
Repeat the steps for each iteration until convergence or a predefined number of updates is reached.
By combining the surrogate loss, KL constraint, and efficient optimization, TRPO achieves stable and effective policy improvement.
4. Evaluating KL Divergence in TRPO
Now, let us dive deeper into how TRPO evaluates this constraint mathematically and why it is practical for reinforcement learning.
4a. Key Steps in Evaluating the KL Divergence
- Trajectory Probability: The probability of a trajectory $\tau$ under the policy parameters $\theta$ is defined as:
$$
P(\tau; \theta) = P(s_0) \prod_{t=0}^{H-1} \pi_\theta(u_t | s_t) P(s_{t+1} | s_t, u_t)
$$
Here, $P(s_0)$ is the probability of the initial state, $\pi_\theta(u_t | s_t)$ is the policy distribution, and $P(s_{t+1} | s_t, u_t)$ models the dynamics of the environment. - KL Divergence Between Policies: The KL divergence between the current policy $P(\tau; \theta)$ and a slightly updated policy $ P(\tau; \theta + \delta\theta)$ is given by:
$$
KL(P(\tau; \theta) \| P(\tau; \theta + \delta\theta)) = \sum_{\tau} P(\tau; \theta) \log \frac{P(\tau; \theta)}{P(\tau; \theta + \delta\theta)}
$$ - Simplification Using Dynamics: By substituting the trajectory probability, we observe that the dynamics $ P(s_{t+1} | s_t, u_t)$ cancel out, leaving:
$$
\sum_\tau P(\tau; \theta) \log \frac{\prod_{t=0}^{H-1} \pi_\theta(u_t | s_t)}{\prod_{t=0}^{H-1} \pi_{\theta + \delta\theta}(u_t | s_t)}
$$ - Expectation Approximation: The KL divergence can then be approximated using samples from the current policy:
$$
\frac{1}{M} \sum_{(s, u)} \log \frac{\pi_\theta(u | s)}{\pi_{\theta + \delta\theta}(u | s)}
$$
Here, $M$ is the number of samples from the rollout data.
This formulation allows TRPO to efficiently compute the KL divergence using the data it collects, ensuring stability in policy optimization.
5. Limitations of TRPO
While TRPO is a significant advancement in reinforcement learning, it comes with its own set of challenges:
-
Second-Order Optimization: TRPO relies on second-order optimization to enforce the KL divergence constraint. This involves computing a conjugate gradient to approximate the Hessian-vector product, which can be computationally expensive and challenging to implement.
-
Complexity of Conjugate Gradient Method: The use of conjugate gradient for solving the optimization problem increases the algorithm's complexity. Additionally, this method doesn't utilize the advanced optimizers available in modern frameworks like TensorFlow or PyTorch.
-
Limited Flexibility: TRPO struggles with handling stochastic networks (e.g., with dropout) and parameter sharing between the policy and value functions, which are commonly used in modern reinforcement learning architectures.
-
Implementation Overhead: Due to its reliance on second-order methods and conjugate gradient, TRPO can be more challenging to implement and tune, especially for large-scale neural networks.
These limitations paved the way for a more practical approach to policy optimization.
6. Summary
Trust Region Policy Optimization (TRPO) is a reinforcement learning algorithm designed to ensure stable policy updates by constraining the KL divergence between the new policy and the old policy. It uses a surrogate loss function to optimize the policy while maintaining a trust region that prevents overly large updates. However, TRPO requires second-order optimization methods, like conjugate gradient, which can be complex and computationally demanding.
In the next lesson, we’ll explore Proximal Policy Optimization (PPO). PPO simplifies TRPO by using a clipped surrogate objective, eliminating the need for second-order optimization. This innovation makes PPO more efficient, easier to implement, and one of the most widely adopted algorithms in reinforcement learning today.