Policy Gradient Estimation with Likelihood Ratios
Last updated: November 25, 2024
1. Introduction
In this lesson, we delve into the Likelihood Ratio Gradient Estimate, a pivotal concept in reinforcement learning's policy gradient methods. This approach allows us to optimize policies by computing gradients of the expected reward with respect to policy parameters, all without requiring knowledge of the environment's dynamics. By leveraging the log-likelihood of actions and reward-weighted gradients, this method provides a robust framework for model-free policy optimization in complex settings.
2a. Objective
Our goal is to maximize the expected return by adjusting the policy parameters θ. This is achieved by computing the gradient of the expected reward with respect to θ, expressed as:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log P(\tau; \theta) R(\tau) \right]
$$
Where:
- P(τ;θ): The probability of a trajectory.
- R(τ): The total reward along that trajectory.
2b. Decomposing the Gradient
- The dynamics model (i.e., the environment) can be ignored because it is independent of the policy parameters θ.
- This simplifies the gradient to:
$$
\nabla_\theta \log P(\tau; \theta) = \sum_{t=0}^H \nabla_\theta \log \pi_\theta(u_t | s_t)
$$
Thus, the gradient can be estimated by summing the gradients of the log-probabilities of actions at each timestep t, conditioned on the corresponding states st.
2c. Reward-weighted Gradient
The policy gradient estimate is computed as the product of the total reward R(τ) and the gradient of the log-policy for each action along the trajectory. The following expression provides us with an unbiased estimate of the gradient, and we can compute it without access to a dynamics model:
$$
\hat{g} = \frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)}; \theta) R(\tau^{(i)})
$$
Where:
- R(τ(i)): The total reward for trajectory τ(i).
- : The number of sampled trajectories.
2d. Neural Network-based Policy
In practice, the policy πθ is parameterized by a neural network. The gradient ∇θlogπθ(ut∣st) can be computed via backpropagation through the network.
For each trajectory:
- Roll out the policy (i.e., sample a trajectory based on the current policy).
- For each state-action pair (st,ut) in the trajectory:
- Compute ∇θlogπθ(ut∣st).
- Scale this gradient by the reward for that trajectory.
- Accumulate gradients across all trajectories to form the policy gradient estimate.
2e. Likelihood Ratio Policy Gradient
This method is often referred to as the likelihood ratio policy gradient because it relies on the log-likelihood of actions under the policy:
$$
\nabla_\theta J(\theta) = \mathbb{E} \left[ R(\tau) \nabla_\theta \log \pi_\theta(u_t | s_t) \right]
$$
The key advantage is that this approach avoids modeling the environment’s dynamics, focusing only on the policy's behavior.
2f. Practical Steps for Gradient Computation
- Roll out the policy: Sample trajectories using the current policy πθ.
- Collect data: Record st, ut, and rewards R(τ) for each trajectory.
- Compute gradients: For each trajectory, compute ∇θlogπθ(ut∣st) at every timestep.
- Scale by rewards: Multiply each gradient by the corresponding total reward R(τ).
- Update policy: Accumulate the gradients across all trajectories and use the result to update the policy parameters θ.
This framework ensures an efficient and model-free gradient estimation process, enabling robust policy optimization.
2g. Conclusion
- No dynamics model required: We only need the policy (neural network) to compute the gradients.
- Gradient is weighted by rewards: Higher reward trajectories contribute more to the policy update.
- Backpropagation: The neural network's parameters θ are adjusted through standard backpropagation based on the gradients calculated for the log-probabilities of actions.
By folowing this method, we can iteratively improve the policy ot maximize the expected reward over time, without needing explicit knowledge of the environment's dynamics.
3. Limitations and Improvements
3a. Limitations
While the policy gradient method is effective, it has significant challenges:
-
Unbiased but Noisy:
The gradient estimate is unbiased but suffers from high variance due to its sample-based nature. This can slow learning and lead to suboptimal updates. -
Real-World Fixes:
To improve its practicality, several techniques are introduced:- Baseline Subtraction: Reducing variance without affecting bias.
- Temporal Structure: Leveraging the sequential nature of the problem for better estimates.
- Advanced Techniques: Trust region methods and natural gradients, which will be covered in the next lecture, ensure more stable updates.
3b. Improving Policy Gradient Practicality
Although unbiased, policy gradient methods can be noisy and imprecise with limited samples. To enhance their real-world applicability:
-
Baselines:
- A baseline is subtracted from the reward to measure the relative advantage of actions.
- This reduces variance while maintaining the unbiased nature of the gradient estimate. For example, using the expected value of the state as a baseline helps improve learning efficiency.
-
Temporal Structure:
- By leveraging the environment's temporal dynamics, we can refine the gradient estimate further. Temporal dependencies provide additional signals to optimize the policy more effectively.
-
Next Steps:
- Future lectures will explore trust region methods and natural gradients, advanced techniques designed to stabilize and accelerate policy updates in complex environments.
4. Summary
The Likelihood Ratio Gradient Estimate focuses on maximizing the expected return by iteratively adjusting policy parameters. The key insights include:
-
Gradient Computation:
The gradient is computed as the sum of log-probability gradients for each action, scaled by the total reward. This provides an unbiased estimate of the gradient. -
Neural Network-based Policies:
Policies are often parameterized as neural networks, where gradients are calculated using backpropagation. These gradients are then aggregated to update the policy parameters. -
Model-Free Optimization:
This method requires no environment dynamics model, relying solely on the sampled trajectories and policy behavior. -
Challenges and Improvements:
Despite its advantages, policy gradient methods suffer from high variance, which can hinder efficient learning. Techniques like baseline subtraction and leveraging temporal structure can mitigate these issues. Advanced methods such as trust region and natural gradients further enhance stability and convergence, topics that will be explored in subsequent lessons.
This lesson establishes the theoretical foundation and practical workflow for applying policy gradients effectively, setting the stage for advanced optimization techniques in reinforcement learning.