The Likelihood Ratio Gradient in Practice

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

We now zoom in on practical aspects of implementing the Likelihood Ratio Gradient Estimate with neural networks. We’ll see how to:

  • Sample trajectories in an environment like Gymnasium’s Lunar Lander.
  • Compute log-probabilities of actions via the policy network in PyTorch.
  • Multiply these by the return (or advantage) to get the gradient.
  • Update $\theta$ via backpropagation .

2. Neural Network-Based Policy

  1. Inputs: State $s_t$ (e.g., position, velocity, angles in Lunar Lander).
  2. Outputs: Probability distribution over actions (discrete or continuous).
    • Discrete: Softmax over actions (Left, Right, Up, Down thrusters).
    • Continuous: Mean and variance parameters for a Gaussian (common in advanced robotics tasks).

3. Computing the Policy Gradient

For each sampled trajectory $\tau^{(i)}$:

  1. Roll Out: Step the environment using the current policy $\pi_\theta$ until done.
  2. Accumulate Rewards: $R(\tau^{(i)}) = \sum_{t=0}^{H-1} r_t$.
  3. Compute Gradients:$$\nabla_\theta \sum_{t=0}^{H-1}\log\pi_\theta(a_t\mid s_t)\; \times\; R(\tau^{(i)}).$$
  4. Update $\theta$ : Combine gradients across all sampled trajectories and take a gradient step with an optimizer (e.g., Adam).

4. Addressing High Variance

  1. Reward Baselines: Subtract a baseline $b$ from the returns, so updates focus on advantage over $b$.
  2. Advantage Estimation (Actor-Critic): Replace $R(\tau)$ with $\hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$.
  3. Entropy Regularization: Encourage exploration by penalizing low-entropy policies.

5. Limitations and Next Steps

  1. High Variance: Pure policy gradient can be unstable, especially in large environments.
  2. Sample Inefficiency: On-policy methods may require many rollouts.
  3. Advanced Solutions:
    • A2C / PPO reduce variance and improve sample efficiency.
    • TRPO / PPO ensure stable updates with trust-region-like constraints.

6. Summary

Implementing the Likelihood Ratio Gradient is straightforward in PyTorch:

  1. Build a policy network.
  2. Roll out trajectories.
  3. Compute returns (or advantages).
  4. Take gradient steps.

We’ll expand on these ideas in code, applying them to tasks like Gymnasium’s Lunar Lander to see how an on-policy RL agent learns to land safely.

Previous Lesson Next Lesson