Introduction to On-Policy Reinforcement Learning

Last updated: December 31, 2024

1. Motivation and Overview

In previous lessons, we explored value-based methods (e.g., Q-Learning, DQN), which are often off-policy because they can learn from experiences not generated by the current policy. Now we shift to on-policy methods, where all learning is directly tied to the current policy’s behavior.

On-Policy: The agent collects data using the policy it is updating.
Off-Policy: The agent can learn from experiences generated by any policy (including replay buffers or older policies).

When is on-policy RL preferred?

Non-stationary or rapidly changing environments: The policy must stay closely aligned with fresh data.
Policy stability: We want the agent’s behavior to remain consistent with its training strategy (fewer wild updates).
Direct feedback: The agent only updates based on the trajectory distribution it currently samples.

2. On-Policy vs. Off-Policy: Sample Efficiency

2a. On-Policy Methods

Pros
- The policy always updates based on its *most recent* behavior.
- Naturally better at handling non-stationary tasks.
Cons:
- Sample-inefficient : Each experience can only be used once because the policy changes every iteration.
- Potentially slower to converge in complex or large environments.

2b. Off-Policy Methods

Pros
- Experience can be reused multiple times (e.g., replay buffers).
- Generally more sample-efficient .
Cons
- Training can diverge if updates are not carefully managed.
- The learned policy might differ substantially from the data-collection policy.

3. Key On-Policy Algorithms

Over the years, several on-policy algorithms have been developed. Notably:

REINFORCE (Monte Carlo Policy Gradient)
- Pure, straightforward policy gradient using full-episode returns.
- High variance updates but conceptually simple.
Advantage Actor-Critic (A2C)
- Introduces a critic (value function) to reduce variance, stabilizing training.
- Frequently uses multiple parallel environments to speed up learning.
Trust Region Policy Optimization (TRPO)
- Restricts policy updates to a “trust region” to avoid large, destabilizing steps.
- Often yields stable improvements but can be computationally expensive .
Proximal Policy Optimization (PPO)
1. A practical alternative to TRPO with a clipped objective that keeps updates within a sensible range.
2. Strikes a balance between performance and computational simplicity.
3. One of the most popular on-policy algorithms for continuous control (like Lunar Lander).

4. Why Choose On-Policy?

Stable, Direct Policy Improvement: You learn exactly from what the policy is doing now.
Seamless Adaptation to Changes: If the environment changes, the policy updates reflect that change immediately.
Natural Exploration: Typically, policies are stochastic in on-policy RL, which encourages exploration (the agent tries different actions in proportion to their probabilities).

5. Summary

On-policy methods are an excellent choice when stable and up-to-date policy improvements are paramount, even if it means sacrificing some sample efficiency. In the next lessons , we’ll dive deeper into stochastic policies , policy gradients , and eventually learn how to implement these methods in PyTorch , with applications to Gymnasium’s Lunar Lander.

Introduction to On-Policy Reinforcement Learning

1. Motivation and Overview

When is on-policy RL preferred?

2. On-Policy vs. Off-Policy: Sample Efficiency

2a. On-Policy Methods

Pros

Cons:

2b. Off-Policy Methods

Pros

Cons

3. Key On-Policy Algorithms

REINFORCE (Monte Carlo Policy Gradient)

Advantage Actor-Critic (A2C)

Trust Region Policy Optimization (TRPO)

Proximal Policy Optimization (PPO)

4. Why Choose On-Policy?

5. Summary