Stochastic Policies & Policy Optimization
Last updated: December 31, 2024
1. Introduction
In policy gradient methods, we directly optimize a parameterized policy $\pi_\theta(a \mid s)$ rather than learning a value function for subsequent decision-making. The policy outputs a probability distribution over actions, making it stochastic . This approach differs from deterministic policies—valuable for exploration and smoother optimization.
We denote:
$$\max_{\theta} \; \mathbb{E}_{\pi_\theta}\Bigl[ \sum_{t=0}^{H} R(s_t, a_t) \Bigr],$$
where $H$ can be a finite or infinite horizon.
2. Why Use a Stochastic Policy?
-
Smoothing the Optimization
- Deterministic policies can lead to non-smooth optimization landscapes (especially in high-dimensional or continuous action spaces).
- A stochastic policy parameterized by $\theta$ yields a smoother objective, often making training more stable.
-
Built-In Exploration
- Randomness in action selection helps the agent discover potentially better strategies.
- This is crucial in non-stationary or partially explored environments to avoid premature convergence.
3. Why Policy Optimization?
-
Direct Action Prescriptions
- A policy directly prescribes which action to take rather than relying on a separate argmax over Q-values.
-
Computational Simplicity
- In continuous action spaces, choosing $\arg\max_{a} Q(s,a)$ can be expensive ; policy outputs an action instantly .
-
Compatibility with Stochastic Environments
- Stochastic policies naturally handle uncertain or changing dynamics.
-
Ease of Deployment
- In robotics or continuous control, a policy network can be fed states to produce continuous torques or velocities without an external optimization step.
4. Summary
Stochastic policy optimization is a powerful paradigm for controlling complex environments. Rather than computing a Q-table or function, you directly learn the mapping from state to a distribution of actions. Next, we’ll show how to optimize these stochastic policies via policy gradients and how they form the basis of algorithms like REINFORCE and PPO.