Introduction to Q-Learning

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

In the previous lessons, we explored Exact Solution Methods (Value Iteration, Policy Iteration) and the fundamentals of MDPs . Now, we shift to model-free reinforcement learning, where the agent does not require a full model of the environment’s dynamics.

A foundational model-free RL algorithm is Q-learning. It enables an agent to learn the optimal value of taking any action in any state, ultimately guiding it to choose actions that maximize long-term rewards. Unlike dynamic programming approaches that rely on knowing $P(s' \mid s, a)$, Q-learning learns directly from experience.

2. How Q-Learning Works

  1. Agent-Environment Interaction

    • At each time step $t$, the agent observes the current state $s_t$.
    • The agent selects an action $a_t$.
    • The environment responds with a reward $r_t$ and a new state $s_{t+1}$.
  2. Q-Values

    • A Q-value , $Q(s, a)$, estimates the expected discounted return for taking action $a$ in state $s$ and then continuing optimally thereafter.
    • Over many interactions, the agent refines these Q-values to approach the optimal action-value function $Q^*(s, a)$.

3. Bellman Equation for Q-values

The Bellman Equation underpins Q-learning by expressing $Q^*(s, a)$ in terms of immediate and future rewards:

$$Q^*(s,a) = \sum_{s'} P(s' \mid s, a) \Big[ R(s,a,s') + \gamma \max_{a'}Q^*(s', a') \Big].$$

  • $Q^*(s,a)$: Optimal Q-value.
  • $\max_{a'}Q^*(s', a')$: Greedy choice from state $s'$.
  • $\gamma$: Discount factor.

4. Q-Value Iteration

Using the Bellman Equation, we iteratively update estimates of $Q^*(s,a)$ until convergence:

$$Q_{k+1}(s, a)\;\leftarrow\;\sum_{s'} P(s'\mid s, a)\Big[R(s,a,s') + \gamma \max_{a'}Q_k(s', a')\Big].$$

In practice, exact iteration is often replaced by sampling from the real environment (i.e., we observe $(s,a,r,s')$ directly) to update our Q-values.

5. (Tabular) Q-Learning

When the state-action space is manageable , we can store Q-values in a table (a 2D array for discrete $s$ and $a$). The tabular Q-learning update for each experience tuple $(s,a,r,s')$ is:

$$Q_{k+1}(s, a)\;\leftarrow\;(1 - \alpha) \, Q_k(s, a) \;+\;\alpha\biggl[r \;+\; \gamma \max_{a'}Q_k(s', a')\biggr].$$

  • $\alpha$: Learning rate.
  • $\gamma$: Discount factor.

6. Applications and Limitations

6a. Applications

  • Robotics: Simple discrete scenarios (e.g., grid-based navigation).
  • Game AI: Basic board games or simplified game environments.
  • Resource Management: Scheduling or routing with relatively small state spaces.

6b. Limitations

  • Scalability: Infeasible for very large or continuous state spaces (e.g., high-dimensional robotics).
  • Computational Cost: Iterating over huge tables is expensive; memory usage can grow exponentially.

7. Summary

Q-learning provides a simple yet powerful way to learn optimal policies from experience —no need for a transition model. But as soon as the state-action space explodes, we look for ways to approximate Q-values.

In the next lessons , we’ll see how crucial parameters (learning rate, discount factor, exploration) shape Q-learning and how Q-learning transitions to approximate and deep methods for real-world-scale problems.

Next Lesson