MDPs: Markov Decision Processes

Reinforcement Learning

Last updated: December 15, 2024

1. Introduction

Reinforcement Learning (RL) is a branch of machine learning focused on training agents to make sequences of decisions by interacting with an environment to maximize a cumulative reward. Unlike supervised learning, where agents learn from labeled data, RL agents learn from trial and error, adapting their actions based on feedback from the environment. The agent’s objective is to develop a policy—a strategy for choosing actions in various situations—that leads to the highest possible reward over time. RL has been applied to diverse areas, from video games and robotics to finance and healthcare, for tasks that involve dynamic decision-making and adaptation.

Reinforcement Learning Fundamental Diagram

Figure: Reinforcement Learning fundamental diagram

2. Markov Decision Processes

A Markov Decision Process (MDP) is a formal framework for modeling sequential decision-making tasks in reinforcement learning. In an MDP, an agent interacts with an environment through a series of states, actions, and rewards, aiming to learn a policy that maximizes cumulative reward. An MDP is defined by the tuple (S, A, P, R, γ), where:

$$ P(s' | s, a) = \Pr(s_t+1 = s' | s_t = s, a_t = a)  $$

$$ R(s, a, s') = E[r_{t+1} | s_t = s, a_t = a, s_{t+1} = s'] $$

The Markov property asserts that the next state and reward depend only on the current state and action, and not on any past states or actions, which can be expressed as:

$$ Pr(s_{t+1} = s', r_{t+1} = r | s_t, a_t, s_{t-1}, a_{t-1}, … , s_0, a_0) = Pr(s_{t+1} = s', r_{t+1} = r | s_t, a_t) $$

The Horizon defines the number of steps over which the agent seeks to optimize rewards, which may be finite or infinite. A policy π(a|s) describes the probability of choosing action a in state s, and the goal of RL is to learn an optimal policy π* that maximizes the expected cumulative reward over the agent's lifetime.

3. The Goal?

The goal is to maximize the expected discounted sum of rewards accumulated over time. We want to find a policy π that achieves this by maximizing the following expression:

 $$ \max_{\pi} \ \mathbb{E}\left[ \sum_{t=0}^{H} \gamma^t \ R(S_t, A_t, S_{t+1}) \ | \ \pi \right] $$

Where:

In an MDP (Markov Decision Process), we aim to find an optimal policy π* that defines the best action for each state and time step. Formally, an optimal policy π* can be defined as:

$$ \pi*: S \times \{0:H\} \rightarrow A $$

This means that a policy π provides an action for each state S at each time step from 0 up to horizon H. The goal of the policy is to maximize the expected sum of rewards over time.

In the example grid above, each cell represents a state, and the arrows indicate the actions chosen by the policy for each state. The agent is rewarded with +1 for reaching the goal state (right side) and -1 for entering a penalty state (gray cell). The optimal policy guides the agent to maximize the sum of rewards by moving toward the goal while avoiding penalties.

Contrast: If the environment were deterministic, the agent would only need an optimal plan, or a fixed sequence of actions from start to goal. However, in an uncertain environment, the optimal policy must adapt at each step based on the current state.

4. Source

Foundations of Deep RL Series by Pieter Abbeel

Previous Lesson Next Lesson