Lecture Notes on Artificial Intelligence for Games (Summer 2024) PDF

Lecture Notes on Artificial Intelligence for Games Summer Semester 2024 Non-Deterministic Planning Script © 2020 Matthias Schubert Outline Non-Deterministic Problems Markov Reward Processes and Markov Decision Processes Policy Evaluation Optimizing Policies – Value Iteration – Policy Iteration – Asynchronous Dynamic Programming Partially Observable MDPs Factored MDPs Approximation via Replanning Artificial Intelligence for Games 2 Recap the Goat, Wolf, Cabbage Example (0,0,0,0) (0,1,0,0) (0,0,1,0) (0,0,0,1) (0,1,1,0) (0,1,0,1) (0,0,1,1) (0,1,1,1) (1,0,0,0) (1,1,0,0) (1,0,1,0) (1,0,0,1) (1,1,1,0) (11,0,1) (1,0,1,1) (1,1,1,1) You have a goat, a wolf, and a cabbage you want to transport over a river on a small boat. The boat can only carry you and one item. If the wolf is alone with the goat, the wolf eats the goat. If the goat is alone with the cabbage, the goat will eat the cabbage. Planning finds the best solution, but what if…. Deep Learning and Artificial Intelligence 3 Expecting the Unexpected the goat does not like the cabbage the wolf seems not to be hungry and leaves the goat be the boat can sink on each trip, and you might drown ⇒ transitions lead to uncertain future states ⇒ t(s,a) is a stochastic function T: P(s’|s,a) for all s,s’ Î S, a Î A implications: we do not know the future state after performing action a. the reward R(s)/cost C(s) gets stochastic when searching a terminal state, there might be no secure path leading us to the target (the boat can always sink). ⇒ We need an action for each situation we might encounter and not just the ones on a single path to the goal. Artificial Intelligence for Games 4 Markov Reward Process Markov Reward Process ⟨𝑆, 𝑃, 𝑅, 𝛾⟩: – S is a finite set of states – P is a state transition probability matrix 𝑃!,!# = Pr[𝑆%&' = s′|𝑆% = 𝑠] – R is a reward function 𝑅! = 𝐸[𝑅% |𝑆% = 𝑠] – 𝛾 discount factor, 𝛾 ∈ 0,1 Þ describes the likelihood of episodes including rewards Þ assumes a fixed action for each state s Artificial Intelligence for Games 5 Markov Decision Process A Markov Decision Process ⟨𝑆, 𝐴, 𝑃, 𝑅, 𝛾⟩: – S is a finite set of states – A is a finite set of actions – P is a state transition probability matrix 𝑃!,!# = Pr[𝑆%&' = s′|𝑆% = 𝑠, 𝑎] – R is a reward function 𝑅! = 𝐸[𝑅% |𝑆% = 𝑠] – 𝛾 discount factor, 𝛾 ∈ 0,1 Þ explicitly considers actions and their impacts Þ allows to compute the impact of applying different policies on the same environment Artificial Intelligence for Games 6 simple example: Frozen Lake Grid World +1 (2,0) (2,1) (2,2) (2,3) 80% -1 (1,0) (1,2) (1,3) 10% 10% (0,0) (0,1) (0,2) (0,3) states S and rewards R(S) actions transition function P(S’|S,a) +1 (2,0) (2,1) (2,2) (2,3) -1 (1,0) (1,2) (1,3) (0,0) (0,1) (0,2) (0,3) policy Artificial Intelligence for Games 7 Finite and Infinite Horizon MDP When does the process stop? Finite horizon: process terminates after n steps g=1 allowed ⇒ all future rewards/costs are weighted equally but: policies become non-stationary 𝜋 𝑠 → 𝜋 ! 𝑠 Infinite horizon: process potentially goes on forever or until a terminal state is reached termination might depend on the policy (proper policy: eventually reaches goal at some future time) g < 1 usually required: – future rewards/costs are weighted less – rewards of infinite random walks converges to a fixed value – human behaviour favours immediate rewards – immediate rewards earn more interest in financial applications Ergodic Markov Processes optimize the average reward for g=1 Artificial Intelligence for Games 8 Evaluating Policies We can compute the (discounted) reward of an episode, but how can we compute the reward of following a policy. ⇒ for policy p a variety of different episodes and rewards is possible Estimate the expected reward following p : We define the value function (or utility) of state s following policy p as: 𝑣 ! 𝑠" = 𝐸 ∑% # #$" 𝛾 𝑅(𝑠# ) 𝑠" = 𝑠, 𝜋 Correspondingly, we define the action-value function: 𝑞! 𝑠, 𝑎 = 𝐸 ∑% # #$" 𝛾 𝑅(𝑠# ) 𝑠" = 𝑠, 𝐴" = 𝑎, 𝜋 ⇒ How to compute 𝑣 ! 𝑠" and 𝑞! 𝑠, 𝑎 ? Artificial Intelligence for Games 9 Bellman’s Equations What is the reward of following 𝜋? (prediction) Value function 𝑉 ! 𝑠 : expected reward when following 𝜋 in state s w.r.t. some 0 ≤ 𝛾 ≤ 1: 𝑣 ( 𝑠 = 𝐸[∑+ % %)* 𝛾 𝑅(𝑠% ) |𝑠* = 𝑠, 𝜋] 𝑣 ( 𝑠 = 𝛾 *𝑅 𝑠 + 𝐸[∑+ %)' 𝛾 % 𝑅(𝑠 ) |𝑠 = 𝑠, 𝜋] % * 𝑣 ( 𝑠 = 𝑅 𝑠 + 𝛾 ∑!" ∈- 𝑃 𝑠 # 𝑠, 𝜋 𝑣 ( 𝑠 # (Bellmann Equation) What is the optimal policy 𝜋*? (control) Bellman Optimality Equation: 𝑣 !∗ 𝑠 = 𝑅 𝑠 + 𝛾 𝑎𝑟𝑔max 3 𝑃 𝑠 , 𝑠, 𝑎 𝑣 !∗ (𝑠 , ) '∈) * ! ∈+ Artificial Intelligence for Games 10 Policy Evaluation Policy evaluation is much simpler than solving the bellman optimality equation (c.f. Markov Reward Process) 𝑣 ! 𝑠 = 𝑅 𝑠 + 𝛾 1 𝑃 𝑠 , 𝑠, 𝜋 𝑣 ! (𝑠 , ) *! ∈+ Note that the non-linear function “max” is not present. We can solve this by standard algorithms for linear equations systems (operations research: most applications solved as linear programs) For large state spaces solving linear equation systems takes (𝑂(𝑛- )) In large state spaces a simplified Bellman update for k times can be more performant 𝑣"./ 𝑠 ← 𝑅 𝑠 + 𝛾 1 𝑃 𝑠 , 𝑠, 𝜋" 𝑣" (𝑠 , ) *! Artificial Intelligence for Games 11 Finding optimal Policies: Policy Iteration We are looking for the optimal policy. The exact state values are not relevant if one action is clearly the optimal. Idea: Alternate between: Policy evaluation: given a policy, calculate the corresponding state values. Policy improvement: Calculate the policy given the values 𝜋 𝑠 = 𝑎𝑟𝑔max ∑!" ∈- 𝑃 𝑠 # 𝑠, 𝑎 𝑣 ∗ (𝑠 # ).∈/ (greedy policy selection) Artificial Intelligence for Games 12 Policy Iteration repeat 𝑣 ← PolicyEvaluation(𝜋, 𝑣, 𝑚𝑑𝑝) 𝑢𝑛𝑐ℎ𝑎𝑛𝑔𝑒𝑑? ← 𝑡𝑟𝑢𝑒 for each state s in S do if max ∑𝒔! 𝑷 𝒔, 𝒔, 𝒂 𝒗 𝒔, > ∑𝒔! 𝑷 𝒔, 𝒔, 𝝅 𝒗 𝒔, then 𝒂∈𝑨 𝝅 𝒔 ← argmax ∑𝒔" 𝑷 𝒔# 𝒔, 𝒂 𝒗[𝒔# ].∈/ 𝒖𝒏𝒄𝒉𝒂𝒏𝒈𝒆𝒅? ← 𝒇𝒂𝒍𝒔𝒆 until 𝑢𝑛𝑐ℎ𝑎𝑛𝑔𝑒𝑑? return 𝜋 Artificial Intelligence for Games 13 Value Iteration if we use Bellman updates anyway, we can join both steps update Bellman optimality equation directly: 𝑣 ∗ 𝑠 = 𝑅 𝑠 + 𝛾 𝑎𝑟𝑔max d 𝑃 𝑠 # 𝑠, 𝑎 𝑣 ∗ (𝑠 # ).∈/ !" ∈- Non-linear system of equations Use Dynamic Programming: compute utility values for each state by using the current utility estimate repeat until it converges to 𝑣 ∗ convergence can be proofed by contraction (c.f.: S.Russel, P. Norwig: Artificial Intelligence: A modern Approach, Pearson(2016), page 654 for the proof) Artificial Intelligence for Games 14 Value Iteration repeat 𝑣 ← 𝑣# 𝛿←0 for each state s in S do 𝑣 # 𝑠 ← 𝑅 𝑠 + 𝛾 max ∑!" 𝑃 𝑠 # 𝑠, 𝑎 𝑣[𝑠 # ].∈/ if 𝑣 # 𝑠 − 𝑣 𝑠 > 𝛿 then 𝛿 ← 𝑣 # 𝑠 − 𝑉 𝑠 2 '34 until 𝛿 < 4 return 𝑈 Artificial Intelligence for Games 15 Summary Synchronous Dynamic Programming algorithms are based on state-value function V(s) policy iteration (prediction & control): iterates prediction (Bellman Expectation Equation) and evaluation (Greedy Policy Improvement) Value Iteration (control): updates utility and policy based on Bellman Optimality Equation Complexity O(mn2) per iteration for m actions and n states (with Bellmann updates) Artificial Intelligence for Games 16 Further Variants of MDPs Countably infinite state and/or action spaces ⇒ can be handled similarly as ordinary MDPs Continuous state and/or action spaces ⇒ Closed form for linear quadratic model (LQR) Continuous time – requires partial differential equations – Hamilton-Jacobi-Bellman (HJB) equation – Limiting case of Bellman equation ( t→0) … Artificial Intelligence for Games 17 Partial Observability usually agents do not see the game state but only observations about it (e.g. Fog of War, Poker, Black Jack,..) will still assume total knowledge of the MDP (S,A,T,R,𝛄) at each step the agent observes an observation 𝑜% instead of 𝑠% observation 𝑜% merely depends on the state 𝑠% observations O might be completely different from S observations are usually not Markov: – sequence of all observations 𝑜/ , 𝑜3 ,.. , 𝑜# is potentially useful – 𝑜/ , 𝑜3 ,.. , 𝑜# allows only estimating the true state ⇒ policies must be based on 𝑜', 𝑜5,.. , 𝑜% rather than on s Artificial Intelligence for Games 18 Partially Observable Markov Decision Process A Partially Observable Markov Decision Process (POMDP) is a MDP with hidden states (c.f. Hidden Markov Model). A POMDP is a tuple ⟨𝑆, 𝐴, 𝑂, 𝑃, 𝑅, 𝑍, 𝛾⟩: – S is a finite set of states – A is a finite set of actions – O is a finite set of observations – P is a state transition probability matrix ' 𝑃*,* ! = Pr[𝑆#./ = s′|𝑆# = 𝑠, 𝐴# = 𝑎] – R is a reward function 𝑅* = 𝐸[𝑅#./ |𝑆# = 𝑠] – Z is an observation function, 𝑍*'! 5 = 𝑃[𝑂#./ = 𝑜|𝑆# = 𝑠 , , 𝐴# = 𝑎] – 𝛾 discount factor, 𝛾 ∈ 0,1 Þ we only observe o but not s Þ Z describes the likelihood of observation o when taking action a in state s’ Artificial Intelligence for Games 19 Belief States Assume the history ℎ# = 𝑂/ , 𝑅/ , 𝐴/ , 𝑂3 , 𝑅3 , 𝐴3 ,…, 𝐴#6/ , 𝑂# , 𝑅#. A belief state b(h) is a probability distribution over states, conditioned on the history h: 𝑏 ℎ = (𝑃 𝑆# = 𝑠/ 𝐻# = ℎ# ,.. , 𝑃 𝑆# = 𝑠 7 𝐻# = ℎ# ) belief states summarize all relevant information for the policy belief states can be computed from the previous belief state, the action a and the observation o : 𝑏 , 𝑠" = 𝑍*'" 5 1 𝑃*'# ,*" 𝑏 𝑠8 *# ∈+ A POMDP can be defined as MDP over belief states: – belief states are continuous distributions over S – policies for POMDP are functions over belief states Artificial Intelligence for Games 20 MDP formulation of POMDP The POMDP ⟨𝑆, 𝐴, 𝑂, 𝑃, 𝑅, 𝑍, 𝛾⟩ is equivalent to the following MDP: states set of distributions B over S Transition function: 𝑃 𝑏′ 𝑎, 𝑏 = d 𝑃(𝑏 # |𝑜, 𝑎, 𝑏)𝑃 𝑜 𝑎, 𝑏 6∈7. d 𝑃. 𝑏 𝑠 𝑤ℎ𝑒𝑟𝑒 𝑃 𝑜 𝑎, 𝑏 = d 𝑍!#,6 !,!# !#∈- !∈- Rewards: 𝑝 𝑏 = ∑!∈- 𝑅! 𝑏(𝑠) actions A and discount factor 𝛾 are the same as in ordinary MDP Artificial Intelligence for Games 21 Optimal Policies for POMDP computing the value function would allow value iteration finding optimal policy 𝜋 𝑏 for a POMDP needs to solve a continuous state space MDP (potentially infinitely many states) problem: policy in fully observable setting is a table with an entry for any state ⇒ we need another way to express the policy ⇒ an action depends on the belief state b ⇒ the belief state b depends on the history of observations ⇒ we can formulate a policy on discrete observations ⇒ fixed conditional plans span all action observation sequences of length d Artificial Intelligence for Games 22 Fixed Conditional Plans a fixed conditional plan for an action 𝑎9 describes all sequences of observations and future actions which are conditional to the observations. it is a tree describing a complete policy for the next d steps we can use conditional plans to compute the value function in d-horizon problems For set P of all conditional plans, we can state $ % &' 𝑃 = 𝐴 $ &' 𝑎" (the number of potential policies is exponential 𝑜" 𝑜# in the number of actions and observations) the value function of following plan 𝑝 ∈ 𝑃 in state s is 𝑎" 𝑎" denoted as 𝛼: 𝑠 and 𝑣 𝑏 = ∑*∈+ 𝑏 𝑠 𝛼: 𝑠 =b 𝛼: where b and 𝛼: are vectors of dimensionality |S| 𝑜" 𝑜# 𝑜" 𝑜# 𝑎# 𝑎" 𝑎# 𝑎" Artificial Intelligence for Games 23 example of a simple POMDP consider a two-state MDP: S = 𝑠', 𝑠5 , R s' = 0, R s5 = 1 Action STAY or GO with success rate 0.9, 𝛾 = 1 Observation O=S and 𝑍!" ,!" =0.6 (the right state is observed) 0.9 0.1 0.1 𝑠# 𝑠$ 0.1 0.9 𝑠# 𝑠$ 0.9 0.9 0.1 GO STAY Artificial Intelligence for Games 24 example: evaluation finite horizon policies for 1 step: 𝛼 *#'; s/ = R s/ + γ 0.9R s/ + 0.1R s3 = 0.1 v(b) 𝛼 *#'; s3 = R s3 + γ 0.1R s/ + 0.9R s3 = 1.9 𝛼

Lecture Notes on Artificial Intelligence for Games (Summer 2024) PDF

Document Details

Tags

Related

Summary

Full Transcript