The Likelihood Ratio Gradient in Practice
Last updated: December 31, 2024
1. Introduction
We now zoom in on practical aspects of implementing the Likelihood Ratio Gradient Estimate with neural networks. We’ll see how to:
- Sample trajectories in an environment like Gymnasium’s Lunar Lander.
- Compute log-probabilities of actions via the policy network in PyTorch.
- Multiply these by the return (or advantage) to get the gradient.
- Update $\theta$ via backpropagation .
2. Neural Network-Based Policy
- Inputs: State $s_t$ (e.g., position, velocity, angles in Lunar Lander).
- Outputs: Probability distribution over actions (discrete or continuous).
- Discrete: Softmax over actions (Left, Right, Up, Down thrusters).
- Continuous: Mean and variance parameters for a Gaussian (common in advanced robotics tasks).
3. Computing the Policy Gradient
For each sampled trajectory $\tau^{(i)}$:
- Roll Out: Step the environment using the current policy $\pi_\theta$ until done.
- Accumulate Rewards: $R(\tau^{(i)}) = \sum_{t=0}^{H-1} r_t$.
- Compute Gradients:$$\nabla_\theta \sum_{t=0}^{H-1}\log\pi_\theta(a_t\mid s_t)\; \times\; R(\tau^{(i)}).$$
- Update $\theta$ : Combine gradients across all sampled trajectories and take a gradient step with an optimizer (e.g., Adam).
4. Addressing High Variance
- Reward Baselines: Subtract a baseline $b$ from the returns, so updates focus on advantage over $b$.
- Advantage Estimation (Actor-Critic): Replace $R(\tau)$ with $\hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$.
- Entropy Regularization: Encourage exploration by penalizing low-entropy policies.
5. Limitations and Next Steps
- High Variance: Pure policy gradient can be unstable, especially in large environments.
- Sample Inefficiency: On-policy methods may require many rollouts.
- Advanced Solutions:
- A2C / PPO reduce variance and improve sample efficiency.
- TRPO / PPO ensure stable updates with trust-region-like constraints.
6. Summary
Implementing the Likelihood Ratio Gradient is straightforward in PyTorch:
- Build a policy network.
- Roll out trajectories.
- Compute returns (or advantages).
- Take gradient steps.
We’ll expand on these ideas in code, applying them to tasks like Gymnasium’s Lunar Lander to see how an on-policy RL agent learns to land safely.