1. Introduction
While Vanilla Policy Gradient is easy to grasp, it can be inefficient and unstable for large or complex environments (like high-dimensional continuous control). A3C (Asynchronous Advantage Actor-Critic) and GAE (Generalized Advantage Estimation) address some of these limitations by:
- Parallelizing data collection (A3C).
- Refining advantage estimates to lower variance (GAE).
2. Challenges with Vanilla Policy Gradient
- Sample Inefficiency: Each iteration requires fresh rollouts, which can be slow.
- High Variance: Return-based gradients can fluctuate wildly across episodes.
- Exploration Difficulties: A single policy might not explore enough in complex tasks.
3. Advantage Estimation Refresher
Recall the advantage $A(s_t, a_t)$ tells us how much better an action $a_t$ was compared to the baseline $V(s_t)$. This helps direct gradient updates more precisely.
4. Asynchronous Advantage Actor-Critic (A3C)
4a. Key Ideas
- Multiple Workers: Instead of a single agent collecting data, multiple agents run in parallel, each with a copy of the environment (like multiple Lunar Landers).
- Actor-Critic: Each worker has an actor (policy) and a critic (value function).
- Asynchronous Updates: Workers periodically update global parameters. This decorrelates data and often speeds up convergence.
4b. Benefits
- Faster Convergence: Parallel data collection means you get more experiences in the same wall-clock time.
- Better Exploration: Different workers may discover different strategies.
- Reduced Variance: The critic helps stabilize policy gradient updates.
5. Generalized Advantage Estimation (GAE)
5a. Motivation
GAE refines advantage estimates by blending multi-step returns, trading off bias and variance:$$\delta_t = r_t + \gamma\,V(s_{t+1}) \;-\; V(s_t),$$
$$A_t^{\text{GAE}(\gamma,\lambda)}=\; \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l}.$$
- $\lambda$: hyperparameter controlling how much you rely on short vs. long returns.
- This approach often yields smoother advantage estimates, improving policy gradient performance.
5b. Benefits
- Lower Variance: More stable updates than raw MC returns.
- Faster Learning: More reliable advantage signals can accelerate convergence.
- Flexibility: Tuning $\lambda$ helps adapt to different tasks or reward structures.
6. Putting It All Together
A3C + GAE can be combined:
- Multiple parallel workers each gather transitions.
- Advantages are computed with GAE for each worker’s data.
- Updates to the global actor-critic parameters happen asynchronously or synchronously (A2C variant).
- The result: a faster, more stable training pipeline that handles large, complex tasks well.
7. Summary
- Vanilla Policy Gradient is a great starting point but struggles with sample efficiency and variance.
- A3C addresses these by parallelizing actor-critic agents, providing speed and exploration boosts.
- GAE refines advantage calculations for lower variance and better performance.
- In practice, you’ll find many advanced RL algorithms (like PPO) building on these ideas to create stable, scalable on-policy solutions.