Policy Gradient Estimation with Likelihood Ratios

Reinforcement Learning

Last updated: November 25, 2024

1. Introduction

In this lesson, we delve into the Likelihood Ratio Gradient Estimate, a pivotal concept in reinforcement learning's policy gradient methods. This approach allows us to optimize policies by computing gradients of the expected reward with respect to policy parameters, all without requiring knowledge of the environment's dynamics. By leveraging the log-likelihood of actions and reward-weighted gradients, this method provides a robust framework for model-free policy optimization in complex settings.

2a. Objective

Our goal is to maximize the expected return by adjusting the policy parameters θ. This is achieved by computing the gradient of the expected reward with respect to θ, expressed as:

$$
 \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log P(\tau; \theta) R(\tau) \right] 
$$

Where:

2b. Decomposing the Gradient

$$
 \nabla_\theta \log P(\tau; \theta) = \sum_{t=0}^H \nabla_\theta \log \pi_\theta(u_t | s_t) 
$$

Thus, the gradient can be estimated by summing the gradients of the log-probabilities of actions at each timestep t, conditioned on the corresponding states st.

2c. Reward-weighted Gradient

The policy gradient estimate is computed as the product of the total reward R(τ) and the gradient of the log-policy for each action along the trajectory. The following expression provides us with an unbiased estimate of the gradient, and we can compute it without access to a dynamics model:

$$
 \hat{g} = \frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P(\tau^{(i)}; \theta) R(\tau^{(i)}) 
$$

Where:

2d. Neural Network-based Policy

In practice, the policy πθ is parameterized by a neural network. The gradient θlog⁡πθ(ut∣st) can be computed via backpropagation through the network.

For each trajectory:

  1. Roll out the policy (i.e., sample a trajectory based on the current policy).
  2. For each state-action pair (st,ut) in the trajectory:
    • Compute θlog⁡πθ(ut∣st).
    • Scale this gradient by the reward for that trajectory.
  3. Accumulate gradients across all trajectories to form the policy gradient estimate.

2e. Likelihood Ratio Policy Gradient

This method is often referred to as the likelihood ratio policy gradient because it relies on the log-likelihood of actions under the policy:

$$
\nabla_\theta J(\theta) = \mathbb{E} \left[ R(\tau) \nabla_\theta \log \pi_\theta(u_t | s_t) \right] 
$$

The key advantage is that this approach avoids modeling the environment’s dynamics, focusing only on the policy's behavior.

2f. Practical Steps for Gradient Computation

  1. Roll out the policy: Sample trajectories using the current policy πθ.
  2. Collect data: Record st, ut, and rewards R(τ) for each trajectory.
  3. Compute gradients: For each trajectory, compute θlog⁡πθ(ut∣st) at every timestep.
  4. Scale by rewards: Multiply each gradient by the corresponding total reward R(τ).
  5. Update policy: Accumulate the gradients across all trajectories and use the result to update the policy parameters θ.

This framework ensures an efficient and model-free gradient estimation process, enabling robust policy optimization.

2g. Conclusion

By folowing this method, we can iteratively improve the policy ot maximize the expected reward over time, without needing explicit knowledge of the environment's dynamics.

3. Limitations and Improvements

3a. Limitations

While the policy gradient method is effective, it has significant challenges:

  1. Unbiased but Noisy:
    The gradient estimate is unbiased but suffers from high variance due to its sample-based nature. This can slow learning and lead to suboptimal updates.

  2. Real-World Fixes:
    To improve its practicality, several techniques are introduced:

    • Baseline Subtraction: Reducing variance without affecting bias.
    • Temporal Structure: Leveraging the sequential nature of the problem for better estimates.
    • Advanced Techniques: Trust region methods and natural gradients, which will be covered in the next lecture, ensure more stable updates.

3b. Improving Policy Gradient Practicality

Although unbiased, policy gradient methods can be noisy and imprecise with limited samples. To enhance their real-world applicability:

  1. Baselines:

    • A baseline is subtracted from the reward to measure the relative advantage of actions.
    • This reduces variance while maintaining the unbiased nature of the gradient estimate. For example, using the expected value of the state as a baseline helps improve learning efficiency.
  2. Temporal Structure:

    • By leveraging the environment's temporal dynamics, we can refine the gradient estimate further. Temporal dependencies provide additional signals to optimize the policy more effectively.
  3. Next Steps:

    • Future lectures will explore trust region methods and natural gradients, advanced techniques designed to stabilize and accelerate policy updates in complex environments.

4. Summary

The Likelihood Ratio Gradient Estimate focuses on maximizing the expected return by iteratively adjusting policy parameters. The key insights include:

  1. Gradient Computation:
    The gradient is computed as the sum of log-probability gradients for each action, scaled by the total reward. This provides an unbiased estimate of the gradient.

  2. Neural Network-based Policies:
    Policies are often parameterized as neural networks, where gradients are calculated using backpropagation. These gradients are then aggregated to update the policy parameters.

  3. Model-Free Optimization:
    This method requires no environment dynamics model, relying solely on the sampled trajectories and policy behavior.

  4. Challenges and Improvements:
    Despite its advantages, policy gradient methods suffer from high variance, which can hinder efficient learning. Techniques like baseline subtraction and leveraging temporal structure can mitigate these issues. Advanced methods such as trust region and natural gradients further enhance stability and convergence, topics that will be explored in subsequent lessons.

This lesson establishes the theoretical foundation and practical workflow for applying policy gradients effectively, setting the stage for advanced optimization techniques in reinforcement learning.

Previous Lesson Next Lesson