Deep Q-Network (DQN) Training Algorithm

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

Deep Q-Networks (DQNs) are a practical implementation of Approximate Q-Learning using deep neural networks. Originally introduced by DeepMind to play Atari games from raw pixels, DQNs incorporate key innovations to stabilize training and handle large-scale tasks.

1. Learning Process Overview

2a. Generating a Target

For each transition $(s,a,r,s')$:

$$\text{target}(s') =\begin{cases}r, & \text{if } s' \text{ is terminal},\\r + \gamma \max_{a'} \hat{Q}(s',a';\theta^-), & \text{otherwise}.\end{cases}$$

2b. Updating Parameters ($\theta$)

We compute a loss between the predicted Q-value $Q(s,a;\theta)$ and the target $\text{target}(s')$. We then apply gradient descent to update $\theta$.

$$L(\theta) = \mathbb{E}\Bigl[ \bigl(\text{target}(s') - Q(s,a;\theta)\bigr)^2 \Bigr].$$

3. Core DQN Components

  1. Replay Buffer ($D$)

    • Stores past experiences, from which we sample mini-batches randomly.
    • Breaks correlation between consecutive samples, enhancing stability.
  2. Target Network ($\hat{Q}$)

    • Maintains a separate set of parameters $\theta^-$ that lag behind the main network ($\theta$) and are updated periodically .
    • Reduces instability caused by a constantly shifting Q-value target.
  3. Epsilon-Greedy Policy

    • Balances exploration ($\epsilon$ random actions) and exploitation ($\arg\max Q$).
    • Decay $\epsilon$ over time to pivot from exploration to exploitation.

4. DQN Algorithm Steps

  1. Initialize Q-network parameters $\theta$, target network $\theta^= \theta$, and replay buffer $D$.
  2. For each episode:
    • Observe state $s_t$.
    • Select action $a_t$ via epsilon-greedy.
    • Execute action, observe reward $r_t$ and next state $s_{t+1}$.
    • Store transition $(s_t,a_t,r_t,s_{t+1})$ in $D$.
    • Sample a mini-batch of transitions from $D$.
    • Compute target values $y_j$ for each transition in the batch.
    • Minimize loss $\frac{1}{N}\sum_j (y_j Q(s_j,a_j;\theta))^2$ w.r.t. $\theta$.
    • Periodically update $\theta^\gets \theta$.
  3. Repeat until convergence or for a fixed number of episodes.

5. Why DQN is Effective

6. Summary

DQN is a landmark in deep reinforcement learning—scalable , powerful , and broadly applicable. With DQN, we can tackle tasks like Atari from raw pixels or even continuous control (with some modifications).In practice (e.g., with PyTorch ), you’ll set up:

Armed with these tools, you can train RL agents in a variety of environments, including the popular Lunar Lander . This approach forms the foundation for even more advanced methods (e.g., Double DQN, Dueling DQN) that further improve stability and performance.

Previous Lesson