Soft Actor Critic Algorithm explained with Lunar Lander

Deep Reinforcement Learning

Last updated: December 24, 2024

1. Introduction

Soft Actor-Critic (SAC) is an off-policy actor-critic method that uses maximum entropy reinforcement learning. Unlike traditional methods (which purely maximize expected return), SAC maximizes the sum of rewards plus an entropy bonus to encourage exploration. Formally, we want to maximize:

$$\mathbb{E}_{\tau \sim \pi} \Bigg[ \sum_{t=0}^{\infty} \gamma^t \bigl( r(s_t, a_t) \;+\; \alpha \,\mathcal{H}(\pi(\cdot \mid s_t)) \bigr) \Bigg],$$

where $\mathcal{H}$ is the policy entropy, $\alpha$ is a temperature parameter, and $\gamma$ is the discount factor.

1a. Why Maximize Entropy?

2. Critic (Soft Q-Function) Update

SAC typically maintains two Q-functions $Q_{\theta_1}$ and $Q_{\theta_2}$ to reduce overestimation bias, but we’ll illustrate the update for a single Q-function $Q_{\theta}$. The target we want $Q_{\theta}$ to match is a “soft” version of the Bellman backup:

$$
J_Q(\theta) \;=\; \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim D}\!
\Biggl[\;
\tfrac{1}{2}\,\bigl(\,
Q_{\theta}(s_t,\, a_t)
\;-\;\bigl[\,
r_t \;+\;\gamma\,\mathbb{E}_{a_{t+1} \sim \pi_{\phi}(\cdot \mid s_{t+1})}\!
\bigl[\,
Q_{\theta_{\text{targ}}}(s_{t+1},\, a_{t+1})
\;-\;\alpha \,\log \pi_{\phi}(a_{t+1} \mid s_{t+1})
\bigr]
\bigr]
\bigr)^{2}
\Biggr].
$$

2a. Intuition

2b. Lunar Lander Angle

3. Actor (Policy) Update

The policy $\pi_\phi(a\mid s)$ is often parameterized by a neural network outputting the mean and variance for a Gaussian distribution. We optimize:

$$
J_\pi(\phi)
\;=\;
\mathbb{E}_{s_t \sim \mathcal{D}}
\biggl[
\mathbb{E}_{a_t \sim \pi_\phi(\cdot \mid s_t)}
\biggl[
\alpha \,\log \pi_\phi(a_t \mid s_t)
\;-\;
Q_{\theta}(s_t, a_t)
\biggr]
\biggr]. 
$$

3a. Intuition

3b. Lunar Lander Angle

4. Temperature $\alpha$ Update

Instead of treating $\alpha$ as a fixed constant, we can let the algorithm learn it to keep the policy’s entropy near a desired target $\bar{\mathcal{H}}$. We minimize:

$$
J(\alpha)
\;=\;
\mathbb{E}_{s_t \sim \mathcal{D},\; a_t \sim \pi_\phi}
\biggl[
-\alpha \log \pi_\phi(a_t \mid s_t)
\;-\;
\alpha \,\overline{\mathcal{H}}
\biggr]. 
$$

Taking gradients and updating $\alpha$ accordingly keeps the actual policy entropy close to $\bar{\mathcal{H}}$.

4a. Lunar Lander Angle

5. Putting It All Together

5a. Experience Collection

5b. Critic Update

5c. Actor Update

5d. Temperature Update

5e. Repeat

  • As you cycle through these steps, the Q-functions become more accurate, the policy becomes more effective, and the balance between exploration and exploitation is tuned automatically.

6. Example Trajectory in Lunar Lander

7. Summary

Overall, Soft Actor-Critic is a powerful method for continuous control tasks, ensuring both high reward and ample exploration. In the Lunar Lander setting, it strikes a balance between carefully aiming for soft, stable touchdowns and staying open to new thruster combinations that might lead to better landings.

Previous Lesson