Likelihood Ratio Policy Gradient

Reinforcement Learning

Last updated: November 25, 2024

1. Introduction

Policy gradient methods aim to optimize the parameters θ of a policy πθ (often represented by a neural network) to maximize the expected cumulative reward. Unlike value-based methods, which rely on estimating value functions, policy gradient methods directly adjust the policy parameters to favor rewarding trajectories.

The Likelihood Ratio Policy Gradient (LRPG) is a key foundation of policy gradient algorithms. By leveraging the log-likelihood trick, LRPG enables efficient computation of gradients, making it practical for optimizing policies in complex environments.

2. Objective

The primary goal of policy gradient methods is to find the optimal parameters θ that maximize the expected reward U(θ):

$$\theta^* = \arg\max_\theta U(\theta), $$

where the expected reward U(θ is defined as:

$$U(\theta) = \sum_{\tau} P_\theta(\tau) R(\tau), $$

with:

The Likelihood Ratio Policy Gradient focuses on modifying Pθ(τ) to increase the likelihood of high-reward trajectories.

2b. Intuition

The core idea is to compute the gradient of U(θ with respect to θ and use it to iteratively update the policy πθ to favor better trajectories. The policy gradient adjusts the probabilities of trajectories based on their rewards:

This probabilistic balancing ensures that the agent learns to favor trajectories that maximize rewards while discarding less effective ones.

3. Derivation of the Policy Gradient 

To derive the policy gradient, start with the expected reward definition U(θ):

$$U(\theta) = \sum_{\tau} P_\theta(\tau) R(\tau)$$

where Pθ(τ) is the probability of a trajectory τ under the policy πθ, and R(τ) is the cumulative reward for the trajectory.

The gradient of U(θ):

$$\nabla_\theta U(\theta) = \nabla_\theta \sum_{\tau} P_\theta(\tau) R(\tau)$$

Using the log-likelihood trick , we rewrite ∇θPθ(τ) as:

$$\nabla_\theta P_\theta(\tau) = P_\theta(\tau) \nabla_\theta \log P_\theta(\tau)$$

Substituting this into the gradient of U(θ, we get:

$$\nabla_\theta U(\theta) = \sum_{\tau} P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) R(\tau)$$

Rewriting this as an Expectation:

Recognizing that $$\mathbb{E}_{\tau \sim P_\theta}[\cdot]$$ denotes the expected value under Pθ(τ), this simplifies to:

$$\nabla_\theta U(\theta) = \mathbb{E}_{\tau \sim P_\theta} \left[ R(\tau) \nabla_\theta \log P_\theta(\tau) \right]$$

3a. The Log-Likelihood Trick

The log-likelihood trick enables efficient computation of the policy gradient. Instead of summing over all possible trajectories (which is computationally intractable), we:

  1. Estimate the gradient using sampled trajectories from the current policy πθ.
  2. Compute the gradient as the expected value of R(τ)∇θlog⁡Pθ(τ), effectively weighting the log-probability gradient of each sampled trajectory by its reward.

This trick makes policy gradient methods practical for high-dimensional and continuous environments, as it avoids the need to compute the full trajectory distribution.

3b. Practical Implementation

To make this computation feasible, in practice the policy gradient is estimated using sampled trajectories:

$$\nabla_\theta U(\theta) \approx \frac{1}{N} \sum_{i=1}^N R(\tau^{(i)}) \nabla_\theta \log P_\theta(\tau^{(i)}), $$

where N is the number of sampled trajectories τ(i). This approach avoids computing the full trajectory distribution Pθ(τ), which is intractable in large environments.

4. Temporal Decomposition

In many environments, rewards are localized rather than spanning an entire trajectory. Temporal decomposition breaks the trajectory into individual steps to focus on local actions and rewards.

4a. Gradient of the Trajectory

The probability of a trajectory τ={s0,u0,s1,u1,…,sH,uH} is given as:

$$
P(\tau; \theta) = P(s_0) \prod_{t=0}^{H-1} P(s_{t+1} | s_t, u_t) \pi_\theta(u_t | s_t) 
$$

where P(st+1 | st, ut) is the environment dynamics, independent of θ. The gradient of the log-probability simplifies to:

$$
 \nabla_\theta \log P_\theta(\tau) = \sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(u_t | s_t). 
$$

This decomposition isolates the gradient to the policy, avoiding dependence on the environment's dynamics.

5. Key Takeaways

The Likelihood Ratio Policy Gradient (LRPG) is a foundational framework in reinforcement learning for optimizing policies by directly computing gradients of the objective function. This approach emphasizes rewarding trajectories, making it adaptable and efficient for a wide range of tasks.

Key Strengths:

  1. Independence from Reward Function Differentiability:

    • LRPG does not require the reward function to be differentiable. The gradient computations depend solely on the policy parameters θ, making this method ideal for environments with sparse or discontinuous rewards.
  2. Flexibility Across Diverse Tasks:

    • The method handles arbitrary reward structures, whether they are continuous, discontinuous, or sparse, enabling its application across a broad spectrum of reinforcement learning problems.

Core Insight: Policy-Centric Learning

The Likelihood Ratio Policy Gradient focuses solely on improving the policy by increasing the likelihood of desirable trajectories. Since the environment’s dynamics are independent of θ, there is no need to model or differentiate the environment. Instead, all updates are directed at refining the policy.

This decomposition simplifies policy gradient algorithms, making them scalable and effective for tasks where modeling the environment is impractical or unnecessary.

6. Applications

Policy gradient methods form the backbone of many reinforcement learning algorithms and are widely applicable:

  1. Core of Policy-Based Algorithms: Foundational to Vanilla Policy Gradient (VPG), REINFORCE, A2C, PPO, and others.
  2. High-Dimensional Action Spaces: Ideal for tasks like robotics or self-driving cars.
  3. Built-in Exploration: Stochastic policies enable better exploration compared to deterministic approaches.

7. Summary

The Likelihood Ratio Policy Gradient offers a mathematically sound and practical approach to reinforcement learning. By computing gradients directly, it enables iterative policy updates to prioritize rewarding trajectories without relying on environment dynamics.

This framework underpins key algorithms like VPG and REINFORCE, making it a cornerstone of modern reinforcement learning.

Previous Lesson Next Lesson