Non-Deterministic Problems in MRP

Study Notes

Non-Deterministic Problems

Traditional planning assumes deterministic transitions, meaning actions have predictable outcomes.
Non-deterministic problems introduce uncertainty, making outcomes unpredictable.
Examples include the goat, wolf, and cabbage scenario where factors like the wolf's hunger or the boat's stability can alter the outcome.
Transitions become stochastic functions represented by P(s'|s,a), defining the probability of reaching state s' from state s after performing action a.
Rewards become stochastic, making it harder to predict the outcome of actions.

Markov Reward Process

Markov Reward Process (MRP) models non-deterministic scenarios with a fixed action for each state.
It consists of states (S), transition probabilities (P), rewards (R), and a discount factor (gamma).
P(s'|s) represents the probability of reaching state s' from state s.
R(s) defines the expected reward obtained in state s.
Gamma discounts future rewards, favoring immediate rewards over long-term benefits.

Markov Decision Process

Markov Decision Process (MDP) extends MRP by adding actions and their impact on state transitions.
It includes the same elements as MRP plus a set of actions (A).
P(s'|s, a) represents the probability of reaching state s' from state s after performing action a.
MDP allows comparing different policies on the same environment.

Finite and Infinite Horizon MDPs

Finite horizon MDPs have a defined number of steps before termination.
Policies are non-stationary, meaning they can change over time.
Infinite horizon MDPs potentially continue forever or until a terminal state is reached.
Policies can be stationary, remaining consistent across time.
Discount factor (gamma) is usually less than 1, giving more weight to immediate rewards and ensuring convergence of rewards.
Ergodic Markov Processes optimize average reward for gamma=1.

Evaluating Policies

The value function (or utility) represents the expected reward from a state s following a policy p.
It is denoted as vπ(s) and calculated as the sum of discounted rewards over future states.
The action-value function qπ(s, a) defines the expected reward from state s after performing action a, following policy p.

Bellman’s Equations

The Bellman Equation states that the value of a state is equal to the immediate reward plus the discounted expected value of its successor states.
The Bellman Optimality Equation defines the optimal value function v*(s) and the optimal policy π* that maximizes the expected reward for each state.

Policy Evaluation

A simplified process compared to solving the Bellman optimality equation.
It calculates the value function based on the Bellman Expectation Equation, without the maximum operator.

Value Iteration

An iterative algorithm aiming to find the optimal policy.
It repeatedly updates the value function for each state using the Bellman Optimality Equation until convergence.
The process continues by maximizing the expected value of successor states.

Policy Iteration

An iterative process involving policy evaluation and policy improvement.
It iterates between evaluating the current policy (prediction) and improving the policy based on the calculated value function (control).
The process continues until the policy converges to the optimal policy.

Partially Observable MDPs (POMDPs)

POMDPs handle scenarios where agents do not know the exact state, but only receive incomplete observations about it.
They are useful for modeling environments with hidden information like Fog of War or card games.
Policies are based on the belief state (b), which is a probability distribution over possible states given the observation history.
Belief states evolve based on the observed information and the executed action.

Fixed Conditional Plans

A set of conditional plans describes sequences of possible observations and actions.
They offer a method to define policies based on observations rather than the whole state.
Conditional plans are often used to compute the value function in POMDPs.

Non-Deterministic Problems in MRP

Choose a study mode

Podcast

Questions and Answers

What does the state transition probability matrix P represent in a Markov Reward Process?

In non-deterministic planning, all future states after an action can be predicted with certainty.

What is a key challenge when expecting the unexpected in non-deterministic problems?

In a Markov Reward Process, the function that provides feedback in terms of rewards is known as the ______.

Match the following components of a Markov Reward Process with their definitions:

What does the parameter $\gamma$ (gamma) represent in a Markov Decision Process?

In a Markov Decision Process, the actions taken have no impact on the state transitions.

What are the components of a Markov Decision Process?

In a finite horizon MDP, the process terminates after ___ steps.

Match the following terms to their descriptions:

Study Notes

Non-Deterministic Problems

Markov Reward Process

Markov Decision Process

Finite and Infinite Horizon MDPs

Evaluating Policies

Bellman’s Equations

Policy Evaluation

Value Iteration

Policy Iteration

Partially Observable MDPs (POMDPs)

Fixed Conditional Plans

Studying That Suits You

Related Documents

More Like This

Mastering Deterministic Finite State Automata

Agent-Design Problems in Multiagent Environments

Deterministic Problems in Games

Optimierungsmodelle und Knapsack-Problem

Quick Share