Soft Actor Critic Algorithm explained with Lunar Lander
Last updated: December 24, 2024
1. Introduction
Soft Actor-Critic (SAC) is an off-policy actor-critic method that uses maximum entropy reinforcement learning. Unlike traditional methods (which purely maximize expected return), SAC maximizes the sum of rewards plus an entropy bonus to encourage exploration. Formally, we want to maximize:
where $\mathcal{H}$ is the policy entropy, $\alpha$ is a temperature parameter, and $\gamma$ is the discount factor.
1a. Why Maximize Entropy?
- Exploration : A higher-entropy policy is more stochastic, leading to broader exploration of the action space.
- Robustness : If multiple actions yield similar value, a higher-entropy policy avoids prematurely committing to just one.
2. Critic (Soft Q-Function) Update
SAC typically maintains two Q-functions $Q_{\theta_1}$ and $Q_{\theta_2}$ to reduce overestimation bias, but we’ll illustrate the update for a single Q-function $Q_{\theta}$. The target we want $Q_{\theta}$ to match is a “soft” version of the Bellman backup:
$$
J_Q(\theta) \;=\; \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim D}\!
\Biggl[\;
\tfrac{1}{2}\,\bigl(\,
Q_{\theta}(s_t,\, a_t)
\;-\;\bigl[\,
r_t \;+\;\gamma\,\mathbb{E}_{a_{t+1} \sim \pi_{\phi}(\cdot \mid s_{t+1})}\!
\bigl[\,
Q_{\theta_{\text{targ}}}(s_{t+1},\, a_{t+1})
\;-\;\alpha \,\log \pi_{\phi}(a_{t+1} \mid s_{t+1})
\bigr]
\bigr]
\bigr)^{2}
\Biggr].
$$
-
$\theta$ are the parameters of the current Q-function.
-
$Q_{\theta_{\mathrm{targ}}}$ is a slowly updated or periodically copied target network.
-
$\alpha\,\log \pi_\phi$ is the entropy penalty term.
2a. Intuition
- We penalize Q-values by the log probability of the next action, which encourages exploration by lowering the Q-values for highly certain (low-entropy) policies.
- The critic’s job is to estimate how good an action is (Q-value) by accounting for immediate reward and future possibilities in a “soft” sense.
2b. Lunar Lander Angle
- If the Q-function sees that small thrust corrections at certain angles lead to consistently safer landings, it will assign higher Q-values to those actions.
- However, the “soft” nature (subtracting $\alpha \,\log\pi$) ensures we don’t overconfidently converge on a single thrust approach—there’s still some preference for exploring alternative thrust maneuvers.
3. Actor (Policy) Update
The policy $\pi_\phi(a\mid s)$ is often parameterized by a neural network outputting the mean and variance for a Gaussian distribution. We optimize:
$$
J_\pi(\phi)
\;=\;
\mathbb{E}_{s_t \sim \mathcal{D}}
\biggl[
\mathbb{E}_{a_t \sim \pi_\phi(\cdot \mid s_t)}
\biggl[
\alpha \,\log \pi_\phi(a_t \mid s_t)
\;-\;
Q_{\theta}(s_t, a_t)
\biggr]
\biggr].
$$
-
The The $\alpha \,\log \pi_\phi$ term increases entropy, pushing the policy to stay stochastic.
-
The The $-Q_{\theta}$ term steers the policy toward actions with high Q-values.
3a. Intuition
-
Minimizing $\alpha \log \pi - Q$ is equivalent to maximizing $Q - \alpha \log \pi$.
-
Hence, the policy tries to pick actions that yield high expected return while remaining randomized .
3b. Lunar Lander Angle
-
If you find that firing the side thruster at certain angles results in a good predicted Q-value, the policy becomes more likely to choose that thruster firing.
-
The randomness is still maintained so that you might discover a new thrust pattern that lands even more smoothly.
4. Temperature $\alpha$ Update
Instead of treating $\alpha$ as a fixed constant, we can let the algorithm learn it to keep the policy’s entropy near a desired target $\bar{\mathcal{H}}$. We minimize:
$$
J(\alpha)
\;=\;
\mathbb{E}_{s_t \sim \mathcal{D},\; a_t \sim \pi_\phi}
\biggl[
-\alpha \log \pi_\phi(a_t \mid s_t)
\;-\;
\alpha \,\overline{\mathcal{H}}
\biggr].
$$
Taking gradients and updating $\alpha$ accordingly keeps the actual policy entropy close to $\bar{\mathcal{H}}$.
-
If the policy entropy is below the target, the algorithm increases $\alpha$ (encouraging more exploration).
-
If the policy is too random, $\alpha$ decreases.
4a. Lunar Lander Angle
-
Early training might see large, random thruster firings, so the algorithm might reduce $\alpha$ if it sees that such extreme randomness is unproductive.
-
Later, if the policy starts repeating a single approach, $\alpha$ might increase so that the pilot tries a few new maneuvers in case they yield even better landings.
5. Putting It All Together
5a. Experience Collection
-
You have your Lunar Lander pilot (actor) pick an action (firing a thruster) based on the current state (lander’s position, velocity, angle, etc.).
-
You observe the new state, the reward (how good or bad the landing attempt was), and you store these transitions in a replay buffer $\mathcal{D}$.
5b. Critic Update
-
Sample a batch from $\mathcal{D}$.
-
For each transition, compute the soft Bellman target, adding the entropy penalty.
-
Update the Q-network parameters to reduce the mean-squared error to these targets.
5c. Actor Update
-
Using the updated Q-function, adjust the policy parameters $\phi$.
-
This step tries to make the policy pick higher-Q actions while preserving randomness.
5d. Temperature Update
-
Adjust $\alpha$ based on whether the policy’s entropy is higher or lower than the target.
-
This keeps the pilot from becoming too timid (no exploration) or too reckless (excessive exploration).
5e. Repeat
- As you cycle through these steps, the Q-functions become more accurate, the policy becomes more effective, and the balance between exploration and exploitation is tuned automatically.
6. Example Trajectory in Lunar Lander
-
Initially, the Lander might randomly fire thrusters, leading to chaotic behavior.
-
The Q-networks observe each outcome and learn to associate certain thrust decisions with better future rewards (e.g., lower crash risk, less fuel penalty).
-
The policy gradually shifts to produce thrust patterns that stabilize landing. It remains partially stochastic, so it might “experiment” with alternative angles and thrust intensities.
-
Over many episodes, the temperature $\alpha$ typically decreases —the policy focuses more on the best found strategies—while still preserving a bit of randomness.
7. Summary
-
Entropy Maximization : SAC ensures the policy stays exploratory by adding an entropy term to the objective.
-
Off-Policy : By using a replay buffer, SAC can learn from experience sampled arbitrarily, making it sample-efficient.
-
Automatic Entropy Tuning : $\alpha$ is adjusted in real-time based on the policy’s current entropy, removing the need for manual balancing of exploration vs. exploitation.
-
Practical Performance : In environments like Lunar Lander , SAC learns robust thrust commands that handle various initial conditions, eventually achieving stable landings with minimal crashes.
Overall, Soft Actor-Critic is a powerful method for continuous control tasks, ensuring both high reward and ample exploration. In the Lunar Lander setting, it strikes a balance between carefully aiming for soft, stable touchdowns and staying open to new thruster combinations that might lead to better landings.