1. Introduction
Policy gradient methods aim to optimize the parameters θ of a policy πθ (often represented by a neural network) to maximize the expected cumulative reward. Unlike value-based methods, which rely on estimating value functions, policy gradient methods directly adjust the policy parameters to favor rewarding trajectories.
The Likelihood Ratio Policy Gradient (LRPG) is a key foundation of policy gradient algorithms. By leveraging the log-likelihood trick, LRPG enables efficient computation of gradients, making it practical for optimizing policies in complex environments.
2. Objective
The primary goal of policy gradient methods is to find the optimal parameters θ∗ that maximize the expected reward U(θ):
$$\theta^* = \arg\max_\theta U(\theta), $$
where the expected reward U(θ is defined as:
$$U(\theta) = \sum_{\tau} P_\theta(\tau) R(\tau), $$
with:
- τ=(s0,u0,s1,u1,…,sH,uH): a trajectory consisting of a sequence of states st and actions ut,
- Pθ(τ): the probability of a trajectory under the policy πθ,
- R(τ)=∑Ht=0R(st,ut): the total reward for trajectory τ.
The Likelihood Ratio Policy Gradient focuses on modifying Pθ(τ) to increase the likelihood of high-reward trajectories.
2b. Intuition
The core idea is to compute the gradient of U(θ with respect to θ and use it to iteratively update the policy πθ to favor better trajectories. The policy gradient adjusts the probabilities of trajectories based on their rewards:
- High-Reward Trajectories: Scale up R(τ)∇θlogPθ(τ), increasing the likelihood of similar trajectories.
- Low-Reward Trajectories: Scale down R(τ)∇θlogPθ(τ), discouraging unproductive behaviors.
This probabilistic balancing ensures that the agent learns to favor trajectories that maximize rewards while discarding less effective ones.
3. Derivation of the Policy Gradient
To derive the policy gradient, start with the expected reward definition U(θ):
$$U(\theta) = \sum_{\tau} P_\theta(\tau) R(\tau)$$
where Pθ(τ) is the probability of a trajectory τ under the policy πθ, and R(τ) is the cumulative reward for the trajectory.
The gradient of U(θ):
$$\nabla_\theta U(\theta) = \nabla_\theta \sum_{\tau} P_\theta(\tau) R(\tau)$$
Using the log-likelihood trick , we rewrite ∇θPθ(τ) as:
$$\nabla_\theta P_\theta(\tau) = P_\theta(\tau) \nabla_\theta \log P_\theta(\tau)$$
Substituting this into the gradient of U(θ, we get:
$$\nabla_\theta U(\theta) = \sum_{\tau} P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) R(\tau)$$
Rewriting this as an Expectation:
Recognizing that $$\mathbb{E}_{\tau \sim P_\theta}[\cdot]$$ denotes the expected value under Pθ(τ), this simplifies to:
$$\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim P_\theta} \left[ R(\tau) \nabla_\theta \log P_\theta(\tau) \right]$$
3a. The Log-Likelihood Trick
The log-likelihood trick enables efficient computation of the policy gradient. Instead of summing over all possible trajectories (which is computationally intractable), we:
- Estimate the gradient using sampled trajectories from the current policy πθ.
- Compute the gradient as the expected value of R(τ)∇θlogPθ(τ), effectively weighting the log-probability gradient of each sampled trajectory by its reward.
This trick makes policy gradient methods practical for high-dimensional and continuous environments, as it avoids the need to compute the full trajectory distribution.
3b. Practical Implementation
To make this computation feasible, in practice the policy gradient is estimated using sampled trajectories:
$$\nabla_\theta U(\theta) \approx \frac{1}{N} \sum_{i=1}^N R(\tau^{(i)}) \nabla_\theta \log P_\theta(\tau^{(i)}), $$
where N is the number of sampled trajectories τ(i). This approach avoids computing the full trajectory distribution Pθ(τ), which is intractable in large environments.
4. Temporal Decomposition
In many environments, rewards are localized rather than spanning an entire trajectory. Temporal decomposition breaks the trajectory into individual steps to focus on local actions and rewards.
4a. Gradient of the Trajectory
The probability of a trajectory τ={s0,u0,s1,u1,…,sH,uH} is given as:
$$
P(\tau; \theta) = P(s_0) \prod_{t=0}^{H-1} P(s_{t+1} | s_t, u_t) \pi_\theta(u_t | s_t)
$$
where P(st+1 | st, ut) is the environment dynamics, independent of θ. The gradient of the log-probability simplifies to:
$$
\nabla_\theta \log P_\theta(\tau) = \sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(u_t | s_t).
$$
This decomposition isolates the gradient to the policy, avoiding dependence on the environment's dynamics.
5. Key Takeaways
The Likelihood Ratio Policy Gradient (LRPG) is a foundational framework in reinforcement learning for optimizing policies by directly computing gradients of the objective function. This approach emphasizes rewarding trajectories, making it adaptable and efficient for a wide range of tasks.
Key Strengths:
-
Independence from Reward Function Differentiability:
- LRPG does not require the reward function to be differentiable. The gradient computations depend solely on the policy parameters θ, making this method ideal for environments with sparse or discontinuous rewards.
-
Flexibility Across Diverse Tasks:
- The method handles arbitrary reward structures, whether they are continuous, discontinuous, or sparse, enabling its application across a broad spectrum of reinforcement learning problems.
Core Insight: Policy-Centric Learning
The Likelihood Ratio Policy Gradient focuses solely on improving the policy by increasing the likelihood of desirable trajectories. Since the environment’s dynamics are independent of θ, there is no need to model or differentiate the environment. Instead, all updates are directed at refining the policy.
This decomposition simplifies policy gradient algorithms, making them scalable and effective for tasks where modeling the environment is impractical or unnecessary.
6. Applications
Policy gradient methods form the backbone of many reinforcement learning algorithms and are widely applicable:
- Core of Policy-Based Algorithms: Foundational to Vanilla Policy Gradient (VPG), REINFORCE, A2C, PPO, and others.
- High-Dimensional Action Spaces: Ideal for tasks like robotics or self-driving cars.
- Built-in Exploration: Stochastic policies enable better exploration compared to deterministic approaches.
7. Summary
The Likelihood Ratio Policy Gradient offers a mathematically sound and practical approach to reinforcement learning. By computing gradients directly, it enables iterative policy updates to prioritize rewarding trajectories without relying on environment dynamics.
This framework underpins key algorithms like VPG and REINFORCE, making it a cornerstone of modern reinforcement learning.