Vanilla Policy Gradient Algorithm

Reinforcement Learning

Last updated: December 15, 2024

1. Introduction

In this lesson, we explore the Vanilla Policy Gradient (VPG) Algorithm, one of the foundational approaches in reinforcement learning for directly optimizing policies. Unlike value-based methods that learn value functions to guide action selection, policy gradient methods like VPG aim to optimize the policy itself by estimating the gradient of the expected reward with respect to the policy parameters.

We will dive into the key mathematical components of the algorithm, understanding how the policy gradient is derived and computed. This includes the use of trajectories to estimate the policy gradient, the role of the log-likelihood function, and the importance of the expected cumulative reward.

By the end of this lesson, you will have a clear understanding of how the Vanilla Policy Gradient Algorithm works, the underlying mathematical principles, and its application to reinforcement learning tasks.

2. What Does This Algorithm Offer?

The Vanilla Policy Gradient (VPG) Algorithm provides a systematic way to iteratively improve the policy by directly optimizing the expected reward. Here's how it works step by step:

3. Full Algorithm Flow

  1. Initialization:

    • Initialize the policy parameters θ, which define the policy πθ(u∣s).
    • Initialize a baseline b, which is often a value function or some approximation that helps reduce variance in the gradient estimate.
  2. Iteration (repeat until convergence):

    • Collect Trajectories:
      • Generate multiple trajectories τ={(s0,u0,r0),…,(sT,uT,rT)} by rolling out the current policy πθ in the environment. These are sequences of states, actions, and rewards by interacting with the environment.
    • Compute Returns and Advantages:
      • Compute the return Rt for each time step t which is the discounted sum of future rewards from time step t onwards:
        $$
         R_t = \sum_{k=t}^T \gamma^{k-t} r_k 
        $$

        where T is the total number of time steps in the trajectory, γ is the discount factor, and rt' is the reward at time t'.

      • Compute the advantage estimate At, where b(st) is the baseline:
        $$
         A_t = R_t - b(s_t) 
        $$

        The advantage tells us how much better the action taken at state st is compared to the expected return (baseline).

    • Refit the Baseline: Refit the baseline b(st) (often a neural network) by minimizing the difference between the baseline and the actual returns across all trajectories and time steps. This is typically a supervised learning problem where the baseline is trained to match the empirical returns from the rollouts.
      $$
       \min \| b(s_t) - R_t \|^2 
      $$
    • Update the Policy: Adjust the policy parameters θ using the policy gradient estiamte. This gradient is computed as the sum of the log-probability of the action taken at time t, weighted by the advantage at that step:
      $$
       \nabla_\theta J(\theta) = \sum_t \nabla_\theta \log \pi(a_t \mid s_t, \theta) \hat{A}_t 
      $$
      where α is the learning rate. This is a stochastic gradient ascent step, where we adjust the policy to increase the likelihood of actions that lead to better-than-average outcomes (as measured by the advantage).
  3. Repeat: Continue the process until the policy converges or achieves a satisfactory performance.

4. Summary

In this lesson, we introduced the Vanilla Policy Gradient (VPG) Algorithm, a fundamental method in reinforcement learning for directly optimizing policies. This method directly maximizes expected rewards and ensures steady policy improvement. In the following lessons, we’ll explore the code implementation of the VPG algorithm, covering trajectory generation, advantage computation, and policy optimization.

Previous Lesson Next Lesson