A3C & GAE: Improved On-Policy Techniques

Deep Reinforcement Learning

Last updated: January 01, 2025

1. Introduction

While Vanilla Policy Gradient is easy to grasp, it can be inefficient and unstable for large or complex environments (like high-dimensional continuous control). A3C (Asynchronous Advantage Actor-Critic) and GAE (Generalized Advantage Estimation) address some of these limitations by:

2. Challenges with Vanilla Policy Gradient

3. Advantage Estimation Refresher

Recall the advantage $A(s_t, a_t)$ tells us how much better an action $a_t$ was compared to the baseline $V(s_t)$. This helps direct gradient updates more precisely.

4. Asynchronous Advantage Actor-Critic (A3C)

4a. Key Ideas

  1. Multiple Workers: Instead of a single agent collecting data, multiple agents run in parallel, each with a copy of the environment (like multiple Lunar Landers).
  2. Actor-Critic: Each worker has an actor (policy) and a critic (value function).
  3. Asynchronous Updates: Workers periodically update global parameters. This decorrelates data and often speeds up convergence.

4b. Benefits

5. Generalized Advantage Estimation (GAE)

5a. Motivation

GAE refines advantage estimates by blending multi-step returns, trading off bias and variance:$$\delta_t = r_t + \gamma\,V(s_{t+1}) \;-\; V(s_t),$$

$$A_t^{\text{GAE}(\gamma,\lambda)}=\; \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l}.$$

5b. Benefits

6. Putting It All Together

A3C + GAE can be combined:

  1. Multiple parallel workers each gather transitions.
  2. Advantages are computed with GAE for each worker’s data.
  3. Updates to the global actor-critic parameters happen asynchronously or synchronously (A2C variant).
  4. The result: a faster, more stable training pipeline that handles large, complex tasks well.

7. Summary

Previous Lesson