1. Introduction
In the previous lessons, we explored Exact Solution Methods (Value Iteration, Policy Iteration) and the fundamentals of MDPs . Now, we shift to model-free reinforcement learning, where the agent does not require a full model of the environment’s dynamics.
A foundational model-free RL algorithm is Q-learning. It enables an agent to learn the optimal value of taking any action in any state, ultimately guiding it to choose actions that maximize long-term rewards. Unlike dynamic programming approaches that rely on knowing $P(s' \mid s, a)$, Q-learning learns directly from experience.
2. How Q-Learning Works
-
Agent-Environment Interaction
- At each time step $t$, the agent observes the current state $s_t$.
- The agent selects an action $a_t$.
- The environment responds with a reward $r_t$ and a new state $s_{t+1}$.
-
Q-Values
- A Q-value , $Q(s, a)$, estimates the expected discounted return for taking action $a$ in state $s$ and then continuing optimally thereafter.
- Over many interactions, the agent refines these Q-values to approach the optimal action-value function $Q^*(s, a)$.
3. Bellman Equation for Q-values
The Bellman Equation underpins Q-learning by expressing $Q^*(s, a)$ in terms of immediate and future rewards:
$$Q^*(s,a) = \sum_{s'} P(s' \mid s, a) \Big[ R(s,a,s') + \gamma \max_{a'}Q^*(s', a') \Big].$$
- $Q^*(s,a)$: Optimal Q-value.
- $\max_{a'}Q^*(s', a')$: Greedy choice from state $s'$.
- $\gamma$: Discount factor.
4. Q-Value Iteration
Using the Bellman Equation, we iteratively update estimates of $Q^*(s,a)$ until convergence:
$$Q_{k+1}(s, a)\;\leftarrow\;\sum_{s'} P(s'\mid s, a)\Big[R(s,a,s') + \gamma \max_{a'}Q_k(s', a')\Big].$$
In practice, exact iteration is often replaced by sampling from the real environment (i.e., we observe $(s,a,r,s')$ directly) to update our Q-values.
5. (Tabular) Q-Learning
When the state-action space is manageable , we can store Q-values in a table (a 2D array for discrete $s$ and $a$). The tabular Q-learning update for each experience tuple $(s,a,r,s')$ is:
$$Q_{k+1}(s, a)\;\leftarrow\;(1 - \alpha) \, Q_k(s, a) \;+\;\alpha\biggl[r \;+\; \gamma \max_{a'}Q_k(s', a')\biggr].$$
- $\alpha$: Learning rate.
- $\gamma$: Discount factor.
6. Applications and Limitations
6a. Applications
- Robotics: Simple discrete scenarios (e.g., grid-based navigation).
- Game AI: Basic board games or simplified game environments.
- Resource Management: Scheduling or routing with relatively small state spaces.
6b. Limitations
- Scalability: Infeasible for very large or continuous state spaces (e.g., high-dimensional robotics).
- Computational Cost: Iterating over huge tables is expensive; memory usage can grow exponentially.
7. Summary
Q-learning provides a simple yet powerful way to learn optimal policies from experience —no need for a transition model. But as soon as the state-action space explodes, we look for ways to approximate Q-values.
In the next lessons , we’ll see how crucial parameters (learning rate, discount factor, exploration) shape Q-learning and how Q-learning transitions to approximate and deep methods for real-world-scale problems.