Markov Decision Processes (MDPs)

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

At the core of RL is the notion that an agent learns from direct interaction with its environment to maximize cumulative reward. Unlike supervised learning’s reliance on labeled examples, RL agents learn by exploring different actions and exploiting their knowledge over time.

In this lesson, we introduce the Markov Decision Process (MDP) , a mathematical framework that underpins virtually all RL algorithms. By the end of this lesson, you should be able to:

  • Define the components of an MDP.
  • Understand how the Markov property shapes RL.
  • Recognize how MDPs formalize sequential decision-making.

2. Markov Decision Processes

An MDP is a formal 5-tuple $(S, A, P, R, \gamma)$, where:

  • $S$ : Set of states.
  • $A$ : Set of actions.
  • $P(s' \mid s, a)$ : Transition probability of moving to state $s'$after taking action $a$ in state $s$.
  • $R(s, a, s')$ : Immediate reward function for the transition $(s \to s')$ via action $a$.
  • $\gamma \in [0,1]$ : Discount factor determining the weight of future rewards.

Markov Property

The next state and reward depend only on the current state and action, not the history:

$$P(s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots ) = P(s_{t+1} = s', r_{t+1} = r \mid s_t, a_t).$$

3. The Goal

The objective is to find a policy $\pi$ that maximizes the expected discounted sum of rewards :

$$\max_{\pi} \; \mathbb{E}\left[ \sum_{t=0}^{H} \gamma^t \; R(S_t, A_t, S_{t+1}) \;\Big|\; \pi \right].$$

  • $H$ : Horizon (finite or infinite).
  • $\gamma$ : Discount factor.
  • $\pi$ : Policy mapping states to actions (deterministic or stochastic).

An optimal policy $\pi^*$ prescribes the best action to take in each state to maximize cumulative reward.

4. Illustrative Example

Consider a grid world where the agent aims to reach a goal state (+1 reward) while avoiding penalty states (-1 reward). If the grid world is deterministic , a single fixed plan might suffice. But in a stochastic grid world, we need a policy that adapts to each state transition’s uncertainty.

This concept extends directly to the Lunar Lander environment: the state includes lander position, velocity, etc.; actions are firing different thrusters; transitions include how the lander’s motion responds; and rewards might reflect stable landings.

5. Why MDPs Matter

  • General Framework : MDPs unify a wide range of tasks under one mathematical model.
  • Foundation of RL : All core RL algorithms (Q-Learning, Policy Gradients, Actor-Critic) are ultimately solving an MDP.
  • Clarity and Modularity : Breaking problems into $S, A, P, R, \gamma$ helps in designing appropriate reward signals and understanding performance.

6. Conclusion

An MDP captures the essence of decision-making in an uncertain environment. Understanding states, actions, rewards, and transitions is crucial before jumping into algorithms .

In the next lesson, we’ll explore Exact Solution Methods (Value Iteration, Policy Iteration) to see how the MDP structure enables us to derive optimal strategies in small, fully-known environments. Later, we’ll move on to the more practical, large-scale RL methods we’ll implement in PyTorch eventually tackling challenges like landing a Lunar Lander by controlling various thrusters.

Previous Lesson Next Lesson