Introduction to On-Policy Reinforcement Learning

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Motivation and Overview

In previous lessons, we explored value-based methods (e.g., Q-Learning, DQN), which are often off-policy because they can learn from experiences not generated by the current policy. Now we shift to on-policy methods, where all learning is directly tied to the current policy’s behavior.

  • On-Policy: The agent collects data using the policy it is updating.
  • Off-Policy: The agent can learn from experiences generated by any policy (including replay buffers or older policies).

When is on-policy RL preferred?

  • Non-stationary or rapidly changing environments: The policy must stay closely aligned with fresh data.
  • Policy stability: We want the agent’s behavior to remain consistent with its training strategy (fewer wild updates).
  • Direct feedback: The agent only updates based on the trajectory distribution it currently samples.

2. On-Policy vs. Off-Policy: Sample Efficiency

2a. On-Policy Methods

  • Pros

    • The policy always updates based on its *most recent* behavior.
    • Naturally better at handling non-stationary tasks.
  • Cons:

    • Sample-inefficient : Each experience can only be used once because the policy changes every iteration.
    • Potentially slower to converge in complex or large environments.

2b. Off-Policy Methods

  • Pros

    • Experience can be reused multiple times (e.g., replay buffers).
    • Generally more sample-efficient .
  • Cons

    • Training can diverge if updates are not carefully managed.
    • The learned policy might differ substantially from the data-collection policy.

3. Key On-Policy Algorithms

Over the years, several on-policy algorithms have been developed. Notably:

  1. REINFORCE (Monte Carlo Policy Gradient)

    • Pure, straightforward policy gradient using full-episode returns.
    • High variance updates but conceptually simple.
  2. Advantage Actor-Critic (A2C)

    • Introduces a critic (value function) to reduce variance, stabilizing training.
    • Frequently uses multiple parallel environments to speed up learning.
  3. Trust Region Policy Optimization (TRPO)

    • Restricts policy updates to a “trust region” to avoid large, destabilizing steps.
    • Often yields stable improvements but can be computationally expensive .
  4. Proximal Policy Optimization (PPO)

    1. A practical alternative to TRPO with a clipped objective that keeps updates within a sensible range.
    2. Strikes a balance between performance and computational simplicity.
    3. One of the most popular on-policy algorithms for continuous control (like Lunar Lander).

4. Why Choose On-Policy?

  • Stable, Direct Policy Improvement: You learn exactly from what the policy is doing now.
  • Seamless Adaptation to Changes: If the environment changes, the policy updates reflect that change immediately.
  • Natural Exploration: Typically, policies are stochastic in on-policy RL, which encourages exploration (the agent tries different actions in proportion to their probabilities).

5. Summary

On-policy methods are an excellent choice when stable and up-to-date policy improvements are paramount, even if it means sacrificing some sample efficiency. In the next lessons , we’ll dive deeper into stochastic policies , policy gradients , and eventually learn how to implement these methods in PyTorch , with applications to Gymnasium’s Lunar Lander.

Next Lesson