1. Introduction
Deep Q-Networks (DQNs) are a practical implementation of Approximate Q-Learning using deep neural networks. Originally introduced by DeepMind to play Atari games from raw pixels, DQNs incorporate key innovations to stabilize training and handle large-scale tasks.
1. Learning Process Overview
2a. Generating a Target
For each transition $(s,a,r,s')$:
$$\text{target}(s') =\begin{cases}r, & \text{if } s' \text{ is terminal},\\r + \gamma \max_{a'} \hat{Q}(s',a';\theta^-), & \text{otherwise}.\end{cases}$$
- $\hat{Q}$ is the target network , a periodically updated copy of the main Q-network.
2b. Updating Parameters ($\theta$)
We compute a loss between the predicted Q-value $Q(s,a;\theta)$ and the target $\text{target}(s')$. We then apply gradient descent to update $\theta$.
$$L(\theta) = \mathbb{E}\Bigl[ \bigl(\text{target}(s') - Q(s,a;\theta)\bigr)^2 \Bigr].$$
3. Core DQN Components
-
Replay Buffer ($D$)
- Stores past experiences, from which we sample mini-batches randomly.
- Breaks correlation between consecutive samples, enhancing stability.
-
Target Network ($\hat{Q}$)
- Maintains a separate set of parameters $\theta^-$ that lag behind the main network ($\theta$) and are updated periodically .
- Reduces instability caused by a constantly shifting Q-value target.
-
Epsilon-Greedy Policy
- Balances exploration ($\epsilon$ random actions) and exploitation ($\arg\max Q$).
- Decay $\epsilon$ over time to pivot from exploration to exploitation.
4. DQN Algorithm Steps
- Initialize Q-network parameters $\theta$, target network $\theta^= \theta$, and replay buffer $D$.
- For each episode:
- Observe state $s_t$.
- Select action $a_t$ via epsilon-greedy.
- Execute action, observe reward $r_t$ and next state $s_{t+1}$.
- Store transition $(s_t,a_t,r_t,s_{t+1})$ in $D$.
- Sample a mini-batch of transitions from $D$.
- Compute target values $y_j$ for each transition in the batch.
- Minimize loss $\frac{1}{N}\sum_j (y_j Q(s_j,a_j;\theta))^2$ w.r.t. $\theta$.
- Periodically update $\theta^\gets \theta$.
- Repeat until convergence or for a fixed number of episodes.
5. Why DQN is Effective
- Experience Replay: Reuses past experiences, improving sample efficiency.
- Target Network: Stabilizes learning by fixing the “target” Q-values over multiple updates.
- Function Approximation: Deep neural networks learn generalizable Q-value estimates in high-dimensional spaces.
6. Summary
DQN is a landmark in deep reinforcement learning—scalable , powerful , and broadly applicable. With DQN, we can tackle tasks like Atari from raw pixels or even continuous control (with some modifications).In practice (e.g., with PyTorch ), you’ll set up:
- A network architecture to process states and output Q-values.
- A replay buffer to store transitions.
- A training loop to iteratively sample transitions, compute targets, and update parameters.
Armed with these tools, you can train RL agents in a variety of environments, including the popular Lunar Lander . This approach forms the foundation for even more advanced methods (e.g., Double DQN, Dueling DQN) that further improve stability and performance.