Podcast
Questions and Answers
What does the state transition probability matrix P represent in a Markov Reward Process?
What does the state transition probability matrix P represent in a Markov Reward Process?
- The set of possible rewards
- The probability of moving from one state to another (correct)
- The total number of states
- The actions available in the system
In non-deterministic planning, all future states after an action can be predicted with certainty.
In non-deterministic planning, all future states after an action can be predicted with certainty.
False (B)
What is a key challenge when expecting the unexpected in non-deterministic problems?
What is a key challenge when expecting the unexpected in non-deterministic problems?
Uncertain future states
In a Markov Reward Process, the function that provides feedback in terms of rewards is known as the ______.
In a Markov Reward Process, the function that provides feedback in terms of rewards is known as the ______.
Match the following components of a Markov Reward Process with their definitions:
Match the following components of a Markov Reward Process with their definitions:
What does the parameter $\gamma$ (gamma) represent in a Markov Decision Process?
What does the parameter $\gamma$ (gamma) represent in a Markov Decision Process?
In a Markov Decision Process, the actions taken have no impact on the state transitions.
In a Markov Decision Process, the actions taken have no impact on the state transitions.
What are the components of a Markov Decision Process?
What are the components of a Markov Decision Process?
In a finite horizon MDP, the process terminates after ___ steps.
In a finite horizon MDP, the process terminates after ___ steps.
Match the following terms to their descriptions:
Match the following terms to their descriptions:
Flashcards are hidden until you start studying
Study Notes
Non-Deterministic Problems
- Traditional planning assumes deterministic transitions, meaning actions have predictable outcomes.
- Non-deterministic problems introduce uncertainty, making outcomes unpredictable.
- Examples include the goat, wolf, and cabbage scenario where factors like the wolf's hunger or the boat's stability can alter the outcome.
- Transitions become stochastic functions represented by P(s'|s,a), defining the probability of reaching state s' from state s after performing action a.
- Rewards become stochastic, making it harder to predict the outcome of actions.
Markov Reward Process
- Markov Reward Process (MRP) models non-deterministic scenarios with a fixed action for each state.
- It consists of states (S), transition probabilities (P), rewards (R), and a discount factor (gamma).
- P(s'|s) represents the probability of reaching state s' from state s.
- R(s) defines the expected reward obtained in state s.
- Gamma discounts future rewards, favoring immediate rewards over long-term benefits.
Markov Decision Process
- Markov Decision Process (MDP) extends MRP by adding actions and their impact on state transitions.
- It includes the same elements as MRP plus a set of actions (A).
- P(s'|s, a) represents the probability of reaching state s' from state s after performing action a.
- MDP allows comparing different policies on the same environment.
Finite and Infinite Horizon MDPs
- Finite horizon MDPs have a defined number of steps before termination.
- Policies are non-stationary, meaning they can change over time.
- Infinite horizon MDPs potentially continue forever or until a terminal state is reached.
- Policies can be stationary, remaining consistent across time.
- Discount factor (gamma) is usually less than 1, giving more weight to immediate rewards and ensuring convergence of rewards.
- Ergodic Markov Processes optimize average reward for gamma=1.
Evaluating Policies
- The value function (or utility) represents the expected reward from a state s following a policy p.
- It is denoted as vπ(s) and calculated as the sum of discounted rewards over future states.
- The action-value function qπ(s, a) defines the expected reward from state s after performing action a, following policy p.
Bellman’s Equations
- The Bellman Equation states that the value of a state is equal to the immediate reward plus the discounted expected value of its successor states.
- The Bellman Optimality Equation defines the optimal value function v*(s) and the optimal policy π* that maximizes the expected reward for each state.
Policy Evaluation
- A simplified process compared to solving the Bellman optimality equation.
- It calculates the value function based on the Bellman Expectation Equation, without the maximum operator.
Value Iteration
- An iterative algorithm aiming to find the optimal policy.
- It repeatedly updates the value function for each state using the Bellman Optimality Equation until convergence.
- The process continues by maximizing the expected value of successor states.
Policy Iteration
- An iterative process involving policy evaluation and policy improvement.
- It iterates between evaluating the current policy (prediction) and improving the policy based on the calculated value function (control).
- The process continues until the policy converges to the optimal policy.
Partially Observable MDPs (POMDPs)
- POMDPs handle scenarios where agents do not know the exact state, but only receive incomplete observations about it.
- They are useful for modeling environments with hidden information like Fog of War or card games.
- Policies are based on the belief state (b), which is a probability distribution over possible states given the observation history.
- Belief states evolve based on the observed information and the executed action.
Fixed Conditional Plans
- A set of conditional plans describes sequences of possible observations and actions.
- They offer a method to define policies based on observations rather than the whole state.
- Conditional plans are often used to compute the value function in POMDPs.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.