1. Introduction
Policy gradient methods focus on adjusting $\theta$ so that $\pi_\theta$ produces high-reward actions more frequently. A cornerstone technique is the Likelihood Ratio Policy Gradient (LRPG), which uses the log-likelihood trick to compute gradients.
2. Objective
We define the expected return under policy $\pi_\theta$:
$$U(\theta) \;=\; \sum_{\tau} P_\theta(\tau)\; R(\tau),$$
where:
- $\tau$ is a trajectory (states and actions over time)
- $P_\theta(\tau)$ is its probability under $\pi_\theta$,
- $R(\tau)$ is the total reward.
Our goal is:
$$\theta^* = \arg\max_\theta\,U(\theta).$$
3. Likelihood Ratio Policy Gradient (LRPG)
Using the log-likelihood trick:
$$\nabla_\theta U(\theta) \;=\; \sum_{\tau} \nabla_\theta P_\theta(\tau)\; R(\tau)\;=\; \sum_{\tau} P_\theta(\tau)\; \nabla_\theta \log P_\theta(\tau)\; R(\tau).$$
We can interpret:
$$\nabla_\theta \log P_\theta(\tau)\;=\;\sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t).$$
Hence,
$$\nabla_\theta U(\theta) \;=\; \mathbb{E}_{\tau \sim P_\theta}\Bigl[R(\tau)\;\sum_{t=0}^{H-1}\nabla_\theta \log \pi_\theta(a_t \mid s_t)\Bigr].$$
Key insight: We don’t need to differentiate environment dynamics, only the policy.
4. Monte Carlo Estimation
In practice, we can’t sum over all trajectories. We sample $m$ trajectories from $\pi_\theta$ and estimate the gradient:
$$\nabla_\theta U(\theta) \;\approx\; \frac{1}{m}\,\sum_{i=1}^m\, R(\tau^{(i)})\,\sum_{t=0}^{H-1}\,\nabla_\theta \log\pi_\theta(a_t \mid s_t).$$
5. Key Takeaways
- Model-Free: We don’t need the environment’s transition function.
- Policy-Centric: All effort goes into improving $\pi_\theta$, which can be a neural network.
- High Variance: Pure policy gradient estimates can be noisy. We use baselines (like value functions) to reduce variance (leading to Actor-Critic methods).
6. Summary
Policy gradients are the foundation for many on-policy algorithms, from REINFORCE to A2C/PPO . The approach’s flexibility allows it to handle complex reward structures and large action spaces—a crucial advantage for tasks like Lunar Lander. In future lessons , we’ll address practical improvements (e.g., baseline subtraction, advantage functions) and advanced algorithms (e.g., PPO).