Stochastic Policies & Policy Optimization

Deep Reinforcement Learning

Last updated: December 31, 2024

1. Introduction

In policy gradient methods, we directly optimize a parameterized policy $\pi_\theta(a \mid s)$ rather than learning a value function for subsequent decision-making. The policy outputs a probability distribution over actions, making it stochastic . This approach differs from deterministic policies—valuable for exploration and smoother optimization.

We denote:

$$\max_{\theta} \; \mathbb{E}_{\pi_\theta}\Bigl[ \sum_{t=0}^{H} R(s_t, a_t) \Bigr],$$

where $H$ can be a finite or infinite horizon.

2. Why Use a Stochastic Policy?

  1. Smoothing the Optimization

    • Deterministic policies can lead to non-smooth optimization landscapes (especially in high-dimensional or continuous action spaces).
    • A stochastic policy parameterized by $\theta$ yields a smoother objective, often making training more stable.
  2. Built-In Exploration

    • Randomness in action selection helps the agent discover potentially better strategies.
    • This is crucial in non-stationary or partially explored environments to avoid premature convergence.

3. Why Policy Optimization?

  1. Direct Action Prescriptions

    • A policy directly prescribes which action to take rather than relying on a separate argmax over Q-values.
  2. Computational Simplicity

    • In continuous action spaces, choosing $\arg\max_{a} Q(s,a)$ can be expensive ; policy outputs an action instantly .
  3. Compatibility with Stochastic Environments

    • Stochastic policies naturally handle uncertain or changing dynamics.
  4. Ease of Deployment

    • In robotics or continuous control, a policy network can be fed states to produce continuous torques or velocities without an external optimization step.

4. Summary

Stochastic policy optimization is a powerful paradigm for controlling complex environments. Rather than computing a Q-table or function, you directly learn the mapping from state to a distribution of actions. Next, we’ll show how to optimize these stochastic policies via policy gradients and how they form the basis of algorithms like REINFORCE and PPO.

Previous Lesson Next Lesson