1. Introduction
In this lesson, we explore the Vanilla Policy Gradient (VPG) Algorithm, one of the foundational approaches in reinforcement learning for directly optimizing policies. Unlike value-based methods that learn value functions to guide action selection, policy gradient methods like VPG aim to optimize the policy itself by estimating the gradient of the expected reward with respect to the policy parameters.
We will dive into the key mathematical components of the algorithm, understanding how the policy gradient is derived and computed. This includes the use of trajectories to estimate the policy gradient, the role of the log-likelihood function, and the importance of the expected cumulative reward.
By the end of this lesson, you will have a clear understanding of how the Vanilla Policy Gradient Algorithm works, the underlying mathematical principles, and its application to reinforcement learning tasks.
2. What Does This Algorithm Offer?
The Vanilla Policy Gradient (VPG) Algorithm provides a systematic way to iteratively improve the policy by directly optimizing the expected reward. Here's how it works step by step:
-
Estimating the Return: The algorithm computes the return, i.e., the cumulative reward, for each time step in a trajectory. This quantifies the total reward achieved starting from a specific point in the episode.
-
Refitting a Baseline: To reduce the variance in the gradient estimates, a baseline (typically a value function) is fitted to approximate the expected returns. This makes learning more stable and efficient.
-
Computing the Advantage: The advantage function is calculated as the difference between the observed return and the baseline. This focuses learning on actions that performed better than expected, ensuring that the policy improves in meaningful ways.
-
Updating the Policy: The policy is updated using the advantage-weighted gradient estimate. This guides the policy towards selecting actions that increase future rewards, effectively improving its performance.
3. Full Algorithm Flow
-
Initialization:
- Initialize the policy parameters θ, which define the policy πθ(u∣s).
- Initialize a baseline b, which is often a value function or some approximation that helps reduce variance in the gradient estimate.
-
Iteration (repeat until convergence):
- Collect Trajectories:
- Generate multiple trajectories τ={(s0,u0,r0),…,(sT,uT,rT)} by rolling out the current policy πθ in the environment. These are sequences of states, actions, and rewards by interacting with the environment.
- Compute Returns and Advantages:
- Compute the return Rt for each time step t which is the discounted sum of future rewards from time step t onwards:
$$
R_t = \sum_{k=t}^T \gamma^{k-t} r_k
$$
where T is the total number of time steps in the trajectory, γ is the discount factor, and rt' is the reward at time t'.
- Compute the advantage estimate At, where b(st) is the baseline:
$$
A_t = R_t - b(s_t)
$$
The advantage tells us how much better the action taken at state st is compared to the expected return (baseline).
- Compute the return Rt for each time step t which is the discounted sum of future rewards from time step t onwards:
- Refit the Baseline: Refit the baseline b(st) (often a neural network) by minimizing the difference between the baseline and the actual returns across all trajectories and time steps. This is typically a supervised learning problem where the baseline is trained to match the empirical returns from the rollouts.
$$
\min \| b(s_t) - R_t \|^2
$$ - Update the Policy: Adjust the policy parameters θ using the policy gradient estiamte. This gradient is computed as the sum of the log-probability of the action taken at time t, weighted by the advantage at that step:
$$
\nabla_\theta J(\theta) = \sum_t \nabla_\theta \log \pi(a_t \mid s_t, \theta) \hat{A}_t
$$
where α is the learning rate. This is a stochastic gradient ascent step, where we adjust the policy to increase the likelihood of actions that lead to better-than-average outcomes (as measured by the advantage).
- Collect Trajectories:
-
Repeat: Continue the process until the policy converges or achieves a satisfactory performance.
4. Summary
In this lesson, we introduced the Vanilla Policy Gradient (VPG) Algorithm, a fundamental method in reinforcement learning for directly optimizing policies. This method directly maximizes expected rewards and ensures steady policy improvement. In the following lessons, we’ll explore the code implementation of the VPG algorithm, covering trajectory generation, advantage computation, and policy optimization.