Soft Actor-Critic Method

Reinforcement Learning

Last updated: November 25, 2024

1. Introduction

The Soft Actor-Critic (SAC) method can be viewed as a maximum entropy version of the Deep Deterministic Policy Gradient (DDPG) algorithm. By incorporating entropy into the optimization process, SAC encourages exploration, prevents premature convergence, and results in a more robust learning strategy. Here’s how SAC achieves these goals:

This structured blend of reward maximization and entropy optimization makes SAC particularly effective in continuous action spaces, where balancing exploration and exploitation is crucial for long-term success.

2. Soft Actor-Critic (SAC) Algorithm Steps

Here's a step-by-step breakdown of the Soft Actor-Critic (SAC)  algorithm, along with the mathematical equations and textual explanations for each step:

2a. Initialize Parameters

Initialize the policy parameters θ, the Q-function parameters Φ1 and Φ2, and an empty replay buffer D. Set the target Q-function parameters $$ \phi_{\text{targ},1} = \phi_1 $$ $$ \phi_{\text{targ},2} = \phi_2 $$

2b. Repeat for Each Episode or Until Convergence

  • Observation and Action Selection
    • Observe the current state s.
    • Select an action a according to the current policy (which is typically a stochastic policy): $$a \sim \pi_\theta(\cdot|s)$$
  • Execute Action in Environment
    • Execute action a in the environment, transitioning to the next state s', receiving a reward r, and a done signal d (indicating whether s' is terminal).
  • Store Transition in Replay Buffer
    • Store the transition (s, a, r, s', d) in the replay buffer D
    • If s' is a terminal state, reset the environment state.

2c. Update Networks

If it’s time to update, repeat for a predefined number of update steps:

where, $$a' \sim \pi_\theta(\cdot | s')$$

This equation represents the expected return considering the minimum value from two Q-functions to reduce overestimation and incorporating entropy (via α) for exploration.

This soft update keeps the target networks close to the Q-networks while reducing instability in training.

2d. Repeat Until Convergence

Continue these steps until the SAC algorithm converges, meaning that the policy and Q-functions stabilize.

3. Summary

This structured process allows SAC to maintain stable and sample-efficient learning by balancing exploration and exploitation with entropy maximization.

Source

Previous Lesson Next Lesson