Fundamentals of Policy Gradient Methods

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

Policy gradient methods focus on adjusting $\theta$ so that $\pi_\theta$ produces high-reward actions more frequently. A cornerstone technique is the Likelihood Ratio Policy Gradient (LRPG), which uses the log-likelihood trick to compute gradients.

2. Objective

We define the expected return under policy $\pi_\theta$:

$$U(\theta) \;=\; \sum_{\tau} P_\theta(\tau)\; R(\tau),$$

where:

  • $\tau$ is a trajectory (states and actions over time)
  • $P_\theta(\tau)$ is its probability under $\pi_\theta$,
  • $R(\tau)$ is the total reward.

Our goal is:

$$\theta^* = \arg\max_\theta\,U(\theta).$$

3. Likelihood Ratio Policy Gradient (LRPG)

Using the log-likelihood trick:

$$\nabla_\theta U(\theta) \;=\; \sum_{\tau} \nabla_\theta P_\theta(\tau)\; R(\tau)\;=\; \sum_{\tau} P_\theta(\tau)\; \nabla_\theta \log P_\theta(\tau)\; R(\tau).$$

We can interpret:

$$\nabla_\theta \log P_\theta(\tau)\;=\;\sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t).$$

Hence,

$$\nabla_\theta U(\theta) \;=\; \mathbb{E}_{\tau \sim P_\theta}\Bigl[R(\tau)\;\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(a_t \mid s_t)\Bigr].$$

Key insight: We don’t need to differentiate environment dynamics, only the policy.

4. Monte Carlo Estimation

In practice, we can’t sum over all trajectories. We sample $m$ trajectories from $\pi_\theta$ and estimate the gradient:

$$\nabla_\theta U(\theta) \;\approx\; \frac{1}{m}\,\sum_{i=1}^m\, R(\tau^{(i)})\,\sum_{t=0}^{H-1}\,\nabla_\theta \log\pi_\theta(a_t \mid s_t).$$

5. Key Takeaways

  • Model-Free: We don’t need the environment’s transition function.
  • Policy-Centric: All effort goes into improving $\pi_\theta$, which can be a neural network.
  • High Variance: Pure policy gradient estimates can be noisy. We use baselines (like value functions) to reduce variance (leading to Actor-Critic methods).

6. Summary

Policy gradients are the foundation for many on-policy algorithms, from REINFORCE to A2C/PPO . The approach’s flexibility allows it to handle complex reward structures and large action spaces—a crucial advantage for tasks like Lunar Lander. In future lessons , we’ll address practical improvements (e.g., baseline subtraction, advantage functions) and advanced algorithms (e.g., PPO).

Previous Lesson Next Lesson