1. Introduction
The Soft Actor-Critic (SAC) method can be viewed as a maximum entropy version of the Deep Deterministic Policy Gradient (DDPG) algorithm. By incorporating entropy into the optimization process, SAC encourages exploration, prevents premature convergence, and results in a more robust learning strategy. Here’s how SAC achieves these goals:
- Balancing Reward and Entropy: SAC maximizes not only the reward but also entropy, which acts as a measure of exploration. This combination enhances the agent’s ability to explore the environment thoroughly, avoiding local optima and gaining a deeper understanding of the reward landscape.
- Entropy-Enhanced Value Function: The value function in SAC includes an entropy term, which is built into the value function neural network. By accounting for both reward and exploration (entropy), the value function provides a richer, more balanced evaluation of the agent’s expected return.
- Q-Target with Entropy: The target for the Q-value network is computed as the reward plus the expected value at the next time step, with entropy influencing this target. This ensures that the Q-value network captures both immediate reward and the incentive for exploratory behavior, guiding the agent toward both efficient and exploratory actions.
- Policy Update via KL Divergence: The policy is updated by minimizing the Kullback-Leibler (KL) divergence between the current policy and a target policy defined by the exponentiated Q-values. This approach optimizes the policy to achieve high Q-values while sustaining the exploratory behavior promoted by the entropy term.
- Iterative Refinement: SAC iteratively refines both policy and value estimates, reinforcing an exploration-driven learning process. This approach leads to a more adaptable and stable policy, especially beneficial in environments with complex dynamics.
This structured blend of reward maximization and entropy optimization makes SAC particularly effective in continuous action spaces, where balancing exploration and exploitation is crucial for long-term success.
2. Soft Actor-Critic (SAC) Algorithm Steps
Here's a step-by-step breakdown of the Soft Actor-Critic (SAC) algorithm, along with the mathematical equations and textual explanations for each step:
2a. Initialize Parameters
Initialize the policy parameters θ, the Q-function parameters Φ1 and Φ2, and an empty replay buffer D. Set the target Q-function parameters $$ \phi_{\text{targ},1} = \phi_1 $$ $$ \phi_{\text{targ},2} = \phi_2 $$
2b. Repeat for Each Episode or Until Convergence
- Observation and Action Selection
- Observe the current state s.
- Select an action a according to the current policy (which is typically a stochastic policy): $$a \sim \pi_\theta(\cdot|s)$$
- Execute Action in Environment
- Execute action a in the environment, transitioning to the next state s', receiving a reward r, and a done signal d (indicating whether s' is terminal).
- Store Transition in Replay Buffer
- Store the transition (s, a, r, s', d) in the replay buffer D
- If s' is a terminal state, reset the environment state.
2c. Update Networks
If it’s time to update, repeat for a predefined number of update steps:
- Sample a Mini-Batch from Replay Buffer
- Randomly sample a batch of transitions B = {(s, a, r, s', d)} from the replay buffer D.
- Compute Target for Q-Functions
Compute the target value y(r, s', d) for the Q-functions:$$y(r, s', d) = r + \gamma (1 - d) \left( \min_{i=1,2} Q_{\phi_{\text{targ},i}}(s', a') - \alpha \log \pi_\theta(a'|s') \right)$$
where, $$a' \sim \pi_\theta(\cdot | s')$$
This equation represents the expected return considering the minimum value from two Q-functions to reduce overestimation and incorporating entropy (via α) for exploration.
- Update Q-Functions
- For each Q-function $$Q_{\phi_i}$$ (where i = 1, 2), update the parameters Φi by minimizing the following loss:
$$\nabla_{\phi_i} \frac{1}{|B|} \sum_{(s, a, r, s', d) \in B} \left( Q_{\phi_i}(s, a) - y(r, s', d) \right)^2 $$ - This step uses a gradient descent update to minimize the difference between the predicted Q-values and the target values y
- For each Q-function $$Q_{\phi_i}$$ (where i = 1, 2), update the parameters Φi by minimizing the following loss:
- Update Policy
- Update the policy parameters θ by maximizing the expected reward with an entropy term to encourage exploration. The gradient ascent for the policy update is:
$$\nabla_{\theta} \frac{1}{|B|} \sum_{s \in B} \left( \min_{i=1,2} Q_{\phi_i}(s, \tilde{a}_\theta(s)) - \alpha \log \pi_\theta(\tilde{a}_\theta(s)|s) \right) $$
where, $\tilde{a}_\theta(s)$ is a sample from $\pi_\theta(\cdot|s)$
It is reparameterized to make it differentiable with respect to θ.
- Update the policy parameters θ by maximizing the expected reward with an entropy term to encourage exploration. The gradient ascent for the policy update is:
- Update Target Networks
- Update the target Q-network parameters using a soft update with a rate ρ:
$$\phi_{\text{targ},i} \leftarrow \rho \phi_{\text{targ},i} + (1 - \rho) \phi_i $$
- Update the target Q-network parameters using a soft update with a rate ρ:
This soft update keeps the target networks close to the Q-networks while reducing instability in training.
2d. Repeat Until Convergence
Continue these steps until the SAC algorithm converges, meaning that the policy and Q-functions stabilize.
3. Summary
- Outer Loop: Repeat for each episode, sampling new transitions from the environment.
- Inner Loop (Update Step): If it’s time to update, sample a batch and perform updates to the Q-functions, policy, and target Q-networks.
- Key Components:
- Q-function target calculation using the minimum of two Q-functions.
- Policy update with entropy regularization to encourage exploration.
- Target Q-network soft updates for stable learning.
This structured process allows SAC to maintain stable and sample-efficient learning by balancing exploration and exploitation with entropy maximization.