Deep Q-Learning

Reinforcement Learning

Last updated: November 25, 2024

1. Introduction

In this lesson, we explore the transition from using a Q-table to employing a parameterized Q-function , typically represented by a neural network, to estimate Q-values. This shift is crucial for handling environments with continuous or high-dimensional state-action spaces , where storing explicit Q-values for every state-action pair is infeasible.We’ll dive into how neural networks approximate Q-values, introducing key concepts like target networks and replay buffers , which are critical for stabilizing training and ensuring effective learning.

2. Learning Process

Deep Reinforcement Learning fundamental diagram

2a. Generating a Target

When the agent transitions from state $s$ to $s'$ after taking action $a$, it receives a reward $R(s, a, s')$. The target value for the Q-function combines this immediate reward with the discounted future reward:
$$
\text{target}(s') = R(s, a, s') + \gamma \max_{a'} Q_{\theta_k}(s', a')
$$

where:

2b. Updating Parameters $\theta$

Unlike tabular Q-learning, where Q-values are directly updated, Approximate Q-Learning uses gradient descent to adjust the parameters $\theta$ of the neural network. The goal is to minimize the difference between the predicted Q-value $Q_{\theta}(s, a)$ and the target value.

2c. Objective Function

The loss function measures the error between the predicted Q-value and the target value:
$$
L(\theta) = \frac{1}{2} \left( Q_{\theta}(s, a) - \text{target}(s') \right)^2
$$

The network parameters are updated via gradient descent:
$$
\theta_{k+1} \leftarrow \theta_k - \alpha \nabla_\theta L(\theta) \bigg|_{\theta = \theta_k}
$$

where $\alpha$ is the Learning rate.

2d. Iterative Learning

To prevent overfitting to individual experiences, the network is trained iteratively using a diverse set of experiences, allowing the Q-function $Q_{\theta}$ to improve incrementally as it encounters new states and actions.

3. Deep Q-Network (DQN) Training Algorithm

Deep Q-Networks (DQNs) are a specific implementation of Approximate Q-Learning, leveraging replay buffers and target networks to stabilize and enhance training. The core steps of the algorithm are as follows:

3a. Replay Buffer

replay buffer $D$ stores past experiences $(s_t, a_t, r_t, s_{t+1})$, enabling the agent to learn from a diverse set of transitions rather than consecutive, highly correlated experiences:
$$
 D = \{ (s_t, a_t, r_t, s_{t+1}) \} 
$$

The buffer has a fixed size, discarding the oldest experiences as new ones are added.

3b. Q-Function Parameterization

The Q-function is approximated by a neural network $Q(s, a; \theta)$, referred to as the main Q-network . To stabilize training, a separate target Q-network $\hat{Q}(s, a; \theta^-)$ is used, with parameters $\theta^-$ updated periodically to match $\theta$.

$$
 Q(s_t, a_t; \theta) \quad (\text{Main Q-network}) 
$$

$$
 \hat{Q}(s_t, a_t; \theta^-) \quad (\text{Target Q-network}) 
$$

3c. Action Selection (Epsilon-Greedy Policy)

The agent selects actions using an epsilon-greedy policy :

3d. Storing Transitions in Replay Buffer

After executing action $a_t$, the agent observes the reward $r_t$ and next state $s_{t+1}$. This transition is stored in the replay buffer $D$:
$$
D \leftarrow D \cup (s_t, a_t, r_t, s_{t+1}).
$$

3e. Sampling from Replay Buffer

A mini-batch of transitions is randomly sampled from the replay buffer for training. Sampling randomly breaks correlations between consecutive experiences, improving training stability.
$$
 \{ (s_j, a_j, r_j, s_{j+1}) \}_{j=1}^N \sim D 
$$

3f. Computing Target Values

For each transition in the mini-batch, compute the target value $y_j$.

Where $ \gamma $ is the discount factor.

3g. Loss Function

The objective is to minimize the difference between the predicted Q-value $ Q(s_j, a_j; \theta) $ and the target value $y_j$ (computed in the previous step). The loss function is the mean squared error between these two quantities:
$$ L(\theta) = \frac{1}{N} \sum_{j=1}^{N} \left( y_j - Q(s_j, a_j; \theta) \right)^2. $$

3h. Gradient Descent

Perform a gradient descent step to update the Q-network parameters $\theta$ by minimizing the loss $L(\theta)$. The gradient of the loss with respect to $\theta$ is computed and used for updating:

$$
 \theta \leftarrow \theta - \alpha \nabla_\theta L(\theta) 
$$

Where $\alpha$ is the learning rate.

3i. Updating the Target Network

Every $C$ steps, the target network’s parameters $\theta^-$ are updated to match the main Q-network’s parameters $\theta$. This periodic update helps stabilize learning by decoupling the target computation from the rapidly changing Q-network:

$$ \theta^- \leftarrow \theta. $$

This concludes the step-by-step breakdown of the Deep Q-Network (DQN) training algorithm.

4. Summary

This lesson introduced the mechanics of Approximate Q-Learning, with a focus on the Deep Q-Network (DQN) algorithm. Key innovations like replay buffers, target networks, and epsilon-greedy action selection address challenges of instability and inefficiency in training, enabling reinforcement learning agents to operate in complex, high-dimensional environments.


Sources

Previous Lesson Next Lesson