Introduction to On-Policy Reinforcement Learning
Last updated: December 31, 2024
1. Motivation and Overview
In previous lessons, we explored value-based methods (e.g., Q-Learning, DQN), which are often off-policy because they can learn from experiences not generated by the current policy. Now we shift to on-policy methods, where all learning is directly tied to the current policy’s behavior.
- On-Policy: The agent collects data using the policy it is updating.
- Off-Policy: The agent can learn from experiences generated by any policy (including replay buffers or older policies).
When is on-policy RL preferred?
- Non-stationary or rapidly changing environments: The policy must stay closely aligned with fresh data.
- Policy stability: We want the agent’s behavior to remain consistent with its training strategy (fewer wild updates).
- Direct feedback: The agent only updates based on the trajectory distribution it currently samples.
2. On-Policy vs. Off-Policy: Sample Efficiency
2a. On-Policy Methods
-
Pros
- The policy always updates based on its *most recent* behavior.
- Naturally better at handling non-stationary tasks.
-
Cons:
- Sample-inefficient : Each experience can only be used once because the policy changes every iteration.
- Potentially slower to converge in complex or large environments.
2b. Off-Policy Methods
-
Pros
- Experience can be reused multiple times (e.g., replay buffers).
- Generally more sample-efficient .
-
Cons
- Training can diverge if updates are not carefully managed.
- The learned policy might differ substantially from the data-collection policy.
3. Key On-Policy Algorithms
Over the years, several on-policy algorithms have been developed. Notably:
-
REINFORCE (Monte Carlo Policy Gradient)
- Pure, straightforward policy gradient using full-episode returns.
- High variance updates but conceptually simple.
-
Advantage Actor-Critic (A2C)
- Introduces a critic (value function) to reduce variance, stabilizing training.
- Frequently uses multiple parallel environments to speed up learning.
-
Trust Region Policy Optimization (TRPO)
- Restricts policy updates to a “trust region” to avoid large, destabilizing steps.
- Often yields stable improvements but can be computationally expensive .
-
Proximal Policy Optimization (PPO)
- A practical alternative to TRPO with a clipped objective that keeps updates within a sensible range.
- Strikes a balance between performance and computational simplicity.
- One of the most popular on-policy algorithms for continuous control (like Lunar Lander).
4. Why Choose On-Policy?
- Stable, Direct Policy Improvement: You learn exactly from what the policy is doing now.
- Seamless Adaptation to Changes: If the environment changes, the policy updates reflect that change immediately.
- Natural Exploration: Typically, policies are stochastic in on-policy RL, which encourages exploration (the agent tries different actions in proportion to their probabilities).
5. Summary
On-policy methods are an excellent choice when stable and up-to-date policy improvements are paramount, even if it means sacrificing some sample efficiency. In the next lessons , we’ll dive deeper into stochastic policies , policy gradients , and eventually learn how to implement these methods in PyTorch , with applications to Gymnasium’s Lunar Lander.