Properties and Key Parameters in Q-Learning

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

Now that you understand Q-learning’s basics, we focus on key parameters that significantly impact learning performance:

We’ll also discuss Q-learning’s off-policy characteristic and how it allows flexible exploration strategies without deviating from learning the optimal policy.

2. Key Parameters

2.1 Learning Rate ($\alpha$)

Controls how much new information overrides old estimates.

Practical Tip:

Start higher and decay $\alpha$ over time to stabilize.

2.2 Discount Factor ($\gamma$)

Balances the importance of future rewards vs. immediate rewards.

Practical Tip:

Choose $\gamma$ based on problem horizon. For short tasks, smaller $\gamma$ may suffice; for tasks requiring foresight (like Lunar Lander), a larger $\gamma$ is typical.

2.3 Exploration vs. Exploitation

Practical Tip:

Epsilon-Greedy: A common strategy where the agent picks a random action with probability $\epsilon$ (exploration) and the best-known action otherwise (exploitation). Gradually reduce $\epsilon$ to converge.

3. Off-Policy Learning

One major advantage of Q-learning: it is off-policy . The updates use:

$$r + \gamma \max_{a'} Q(s', a')$$

regardless of whether the agent actually takes the max action in $s'$. This allows:

4. Strategies for Convergence

  1. Sufficient Exploration: Make sure you systematically explore all relevant states/actions.
  2. Learning Rate Decay: Gradually reduce $\alpha$ so learning becomes more stable over time.
  3. Infinite Visits (Theoretical Requirement): In theory, each state-action pair must be visited infinitely often to guarantee convergence. Practically, use decaying $\epsilon$ to approximate this.

5. Summary

The effectiveness of Q-learning rests on fine-tuning its hyperparameters ($\alpha, \gamma, \epsilon$) and leveraging its off-policy nature. With appropriate tuning and enough experience, tabular Q-learning will converge in small-to-moderate environments.

However, to handle large or continuous state-action spaces, we need approximation techniques. That’s our next stop: Approximate Q-Learning and eventually Deep Q-Learning.

Previous Lesson Next Lesson