Reinforcement Learning: Markov Decision Processes

Study Notes

Reinforcement Learning: Markov Decision Processes

Definition: Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision-maker.
Components of MDP:
1. States (S): A finite set of all possible states in the environment.
2. Actions (A): A finite set of actions available to the agent.
3. Transition Function (P): Defines the probability of transitioning from one state to another given a specific action, denoted as P(s' | s, a).
4. Reward Function (R): Provides feedback to the agent, representing the immediate reward received after performing an action in a given state, denoted as R(s, a).
5. Discount Factor (γ): A value between 0 and 1 that determines the present value of future rewards, influencing the agent’s long-term strategy.
Properties:
- Markov Property: The future state depends only on the current state and action, not on the sequence of events that preceded it.
- Stationarity: The transition and reward functions are typically assumed to be stationary, meaning they do not change over time.
Goal: The primary objective in an MDP is to find a policy (π) that maximizes the expected cumulative reward, often represented as:
- Value Function (V): V(s) estimates the maximum expected return starting from state s.
- Action Value Function (Q): Q(s, a) estimates the maximum expected return starting from state s, taking action a.
Types of Policies:
1. Deterministic Policy: A specific action is chosen for each state.
2. Stochastic Policy: Actions are chosen based on a probability distribution over actions for each state.
Solving MDPs:
- Dynamic Programming: Techniques like Value Iteration and Policy Iteration are used to compute optimal policies and value functions.
- Reinforcement Learning Algorithms: Methods such as Q-Learning and SARSA can be employed to learn optimal policies from interaction with the environment.
Applications: MDPs are widely used in various fields, including robotics, finance, healthcare, and artificial intelligence, where decision-making under uncertainty is essential.

Markov Decision Processes (MDP)

Definition: MDP is a framework for modeling decision-making in environments with random outcomes and controlled actions.

Components of MDP

States (S): Represents all possible states in the environment, forming a finite set.
Actions (A): A finite set of actions that the agent can take at any state.
Transition Function (P): Probability of moving from one state to another given an action, denoted as P(s' | s, a).
Reward Function (R): Immediate feedback to the agent after an action in a state, represented as R(s, a).
Discount Factor (γ): A value between 0 and 1 that emphasizes the importance of immediate versus future rewards.

Properties of MDP

Markov Property: Future states depend solely on the current state and action, independent of past events.
Stationarity: Assumes that transition and reward functions remain unchanged over time.

Goal of MDP

Aim to find a policy (π) that maximizes long-term expected cumulative rewards.

Value Functions

Value Function (V): Estimates the maximum expected return from a given state, denoted as V(s).
Action Value Function (Q): Estimates the maximum expected return from taking an action in a state, denoted as Q(s, a).

Types of Policies

Deterministic Policy: Assigns a single specific action to each state.
Stochastic Policy: Selects actions based on a probability distribution for each state.

Solving MDPs

Dynamic Programming: Techniques like Value Iteration and Policy Iteration help compute optimal policies and value functions.
Reinforcement Learning Algorithms: Learning methods such as Q-Learning and SARSA are used to derive optimal policies through interaction with the environment.