Deep Q-Network (DQN) Training Algorithm

Last updated: December 31, 2024

1. Introduction

Deep Q-Networks (DQNs) are a practical implementation of Approximate Q-Learning using deep neural networks. Originally introduced by DeepMind to play Atari games from raw pixels, DQNs incorporate key innovations to stabilize training and handle large-scale tasks.

1. Learning Process Overview

2a. Generating a Target

For each transition $(s,a,r,s')$:

$$\text{target}(s') =\begin{cases}r, & \text{if } s' \text{ is terminal},\\r + \gamma \max_{a'} \hat{Q}(s',a';\theta^-), & \text{otherwise}.\end{cases}$$

$\hat{Q}$ is the target network , a periodically updated copy of the main Q-network.

2b. Updating Parameters ($\theta$)

We compute a loss between the predicted Q-value $Q(s,a;\theta)$ and the target $\text{target}(s')$. We then apply gradient descent to update $\theta$.

$$L(\theta) = \mathbb{E}\Bigl[ \bigl(\text{target}(s') - Q(s,a;\theta)\bigr)^2 \Bigr].$$

3. Core DQN Components

Replay Buffer ($D$)
- Stores past experiences, from which we sample mini-batches randomly.
- Breaks correlation between consecutive samples, enhancing stability.
Target Network ($\hat{Q}$)
- Maintains a separate set of parameters $\theta^-$ that lag behind the main network ($\theta$) and are updated periodically .
- Reduces instability caused by a constantly shifting Q-value target.
Epsilon-Greedy Policy
- Balances exploration ($\epsilon$ random actions) and exploitation ($\arg\max Q$).
- Decay $\epsilon$ over time to pivot from exploration to exploitation.

4. DQN Algorithm Steps

Initialize Q-network parameters $\theta$, target network $\theta^= \theta$, and replay buffer $D$.
For each episode:
- Observe state $s_t$.
- Select action $a_t$ via epsilon-greedy.
- Execute action, observe reward $r_t$ and next state $s_{t+1}$.
- Store transition $(s_t,a_t,r_t,s_{t+1})$ in $D$.
- Sample a mini-batch of transitions from $D$.
- Compute target values $y_j$ for each transition in the batch.
- Minimize loss $\frac{1}{N}\sum_j (y_j Q(s_j,a_j;\theta))^2$ w.r.t. $\theta$.
- Periodically update $\theta^\gets \theta$.
Repeat until convergence or for a fixed number of episodes.

5. Why DQN is Effective

Experience Replay: Reuses past experiences, improving sample efficiency.
Target Network: Stabilizes learning by fixing the “target” Q-values over multiple updates.
Function Approximation: Deep neural networks learn generalizable Q-value estimates in high-dimensional spaces.

6. Summary

DQN is a landmark in deep reinforcement learning—scalable , powerful , and broadly applicable. With DQN, we can tackle tasks like Atari from raw pixels or even continuous control (with some modifications).In practice (e.g., with PyTorch ), you’ll set up:

A network architecture to process states and output Q-values.
A replay buffer to store transitions.
A training loop to iteratively sample transitions, compute targets, and update parameters.

Armed with these tools, you can train RL agents in a variety of environments, including the popular Lunar Lander . This approach forms the foundation for even more advanced methods (e.g., Double DQN, Dueling DQN) that further improve stability and performance.