Introduction to On-Policy Methods

Reinforcement Learning

Last updated: November 16, 2024

1. Introduction 

On-policy reinforcement learning (RL) methods are fundamental to environments where continuous feedback from the agent's current policy is crucial. These methods involve training agents using data generated by the policy they aim to improve, ensuring that the updates directly reflect the current behavior of the agent. While on-policy methods are less sample-efficient than off-policy techniques, they are valued for their stability and ability to adapt seamlessly to changes in the environment.

This lesson will introduce the principles of on-policy RL methods, focusing on their strengths and trade-offs. We will discuss key algorithms such as REINFORCE, Advantage Actor-Critic (A2C), Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO), emphasizing their design and application in various scenarios.

2. Compute and Sample Efficiency: On-Policy vs. Off-Policy Methods 

2a. On-Policy Methods:

2b. Comparison with Off-Policy Methods:

3. Key On-Policy Algorithms

On-policy methods are widely used in reinforcement learning due to their simplicity and direct policy improvement. Below are the most notable algorithms:

3a. REINFORCE (Monte Carlo Policy Gradient)

3b. Advantage Actor-Critic (A2C)

3c. Trust Region Policy Optimization (TRPO)

3d. Proximal Policy Optimization (PPO)

4. Why Choose On-Policy Methods?

On-policy methods are best suited for environments where:

  1. Policy Stability is critical: The agent's behavior must remain stable and consistent with its learning policy.
  2. Non-Stationarity is a concern: The environment changes over time, requiring the agent to learn continuously based on current interactions.
  3. Exploration vs. Exploitation: Policies are often stochastic, naturally encouraging exploration while optimizing expected rewards. 

5. Summary

On-policy methods prioritize stable and direct policy improvement by using experiences generated from the current policy. Algorithms like REINFORCE, A2C, TRPO, and PPO are widely used in diverse RL applications, from robotics to gaming, where continuous feedback and stable learning are critical.

In the next lessons, we will delve deeper into these algorithms, starting with REINFORCE, to understand their mechanics and how they can be effectively implemented in real-world scenarios.

Next Lesson