CPSC 422: Intelligent Systems - Reinforcement Learning PDF

CPSC 422: Intelligent Systems Unit 04: Reinforcement Learning Jordon Johnson Adapted from slides by Giuseppe Carenini 1 Learning Goals Compare and contrast scenarios in which reinforcement learning (RL) and value iteration would be appropriate Describe and criticize search-based approaches to RL Motivate Q-learning Derive and justify estimates by temporal differences Explain, trace and implement Q-learning Describe and compare techniques to combine exploration with exploitation Describe on-policy learning (SARSA) and compare it to off- policy learning (Q-Learning) 2 Big Picture – CPSC 422 Environment StarAI (statistical-relational AI) Hybrid: Deterministic + Stochastic Deterministic Stochastic Prob. Context-free Grammars Prob. Relational Models Bayesian (Belief) Networks Markov Logics First Order Logics Approx: Gibbs Sampling Ontologies Markov Chains and HMMs Query Full Resolution Forward, Viterbi, … SAT Markov Networks Conditional Random Fields Markov Decision Processes (MDP) and Partially Observable MDP (POMDP) Planning Value Iteration Approximate Inference Reinforcement Learning Representation Reasoning Technique 3 Aside: Map of RL algorithms Boxes with thick lines denote different categories, others denote specific algorithms 4 Unit Overview Q-learning Motivation and derivation Example Exploration vs. Exploitation On-policy Learning (SARSA) 5 Unit Overview Q-learning Motivation and derivation Example Exploration vs. Exploitation On-policy Learning (SARSA) 6 MDPs and Reinforcement Learning (RL) Markov Decision Process Set of states S, set of actions A Transition probabilities to next states P(s’| s, a′) Reward function R(s) or R(s, a) or R(s, a, s’) RL is based on MDPs, but Transition model is not known Reward model is not known For MDPs we have seen how to compute an optimal policy RL learns an optimal policy 7 Search-based Approaches Policy Search (stochastic local search) Start with an arbitrary policy Evaluate the policy Generate some neighbours Repeat Problems: Policy space can be huge n states, m actions -> mn policies Policies are evaluated as a whole cannot directly take into account local behaviours (good or bad) 8 Q-Learning Contrary to search-based approaches, Q-learning learns after every action Q-learning learns components of a policy, rather than the policy itself Recall: Q(s,a) = expected value of doing action a in state s and then following the optimal policy Q( s, a ) = R ( s ) +   P( s ' | a, s )V ( s ' ) * s' 9 Q-values s0 s1 … sk a0 Q[s0,a0] Q[s1,a0] …. Q[sk,a0] a1 Q[s0,a1] Q[s1,a1] … Q[sk,a1] … … … …. … an Q[s0,an] Q[s1,an] …. Q[sk,an] a If the agent had the complete Q-function (i.e. all of the correct Q-values), would it know how to act in every state? A. Yes B. No C. I’ve got a bad feeling about this… 10 Q-values s0 s1 … sk a0 Q[s0,a0] Q[s1,a0] …. Q[sk,a0] a1 Q[s0,a1] Q[s1,a1] … Q[sk,a1] … … … …. … an Q[s0,an] Q[s1,an] …. Q[sk,an] Once the agent has a complete Q-function, it knows how to act in every state How does it know which action to take? How does it learn the Q-values? 11 Q-Values Q(s,a) are known as Q-values Q( s, a ) = R ( s ) +   P( s ' | a, s )V ( s ' ) * s' Q-values are related to the expected utility of state s as follows: 𝜋 ∗ 𝑉 𝑠 = max 𝑄(𝑠, 𝑎) 𝑎 We obtain a relationship between the Q-value in state s and those of the states reachable from s by performing action a Q( s, a ) = R( s ) +   P( s ' | s, a ) max Q( s ' , a ' ) a' s' 12 Learning Q-Values Can we exploit this relation between Q values in “adjacent” states? Q( s, a ) = R( s ) +   P( s ' | s, a ) max Q( s ' , a ' ) a' s' A. Yes B. No C. The cake is a lie Hint: what is known (and not known) in this scenario? 13 Learning Q-Values We obtained this relation between Q values in “adjacent” states Q( s, a ) = R( s ) +   P( s ' | s, a ) max Q( s ' , a ' ) a' s' However: We don’t know the transition probabilities P(s’|s,a) or the reward function R(s) We’ll use a different approach, that is based on the notion of Temporal Difference (TD) 14 Average Through Time Background: Suppose we have a sequence of values (sample data) v1, v2, …, vk Further suppose we want a running approximation of their expected value e.g. given a sequence of midterm grades, estimate the expected value of the next grade not a very good example, we get values for the same unknow entry A reasonable estimate is the average of the first k values: v1 + v2 +.... + vk Ak = k 15 Average Through Time v1 + v2 +.... + vk Ak = k Let’s bring that k up to the top: 𝑘𝐴𝑘 = 𝑣1 + 𝑣2 + ⋯ + 𝑣𝑘−1 + 𝑣𝑘 We can make the same statement for k-1: (𝑘 − 1)𝐴𝑘−1 = 𝑣1 + 𝑣2 + ⋯ + 𝑣𝑘−1 Substituting this back into our equation for k gives us 𝑘𝐴𝑘 = 𝑘 − 1 𝐴𝑘−1 + 𝑣𝑘 16 Average Through Time 𝑘𝐴𝑘 = 𝑘 − 1 𝐴𝑘−1 + 𝑣𝑘 Let’s divide by k again: 1 𝑣𝑘 𝐴𝑘 = 1 − 𝐴𝑘−1 + 𝑘 𝑘 Let’s set αk = 1/k: 𝐴𝑘 = 1 − 𝛼𝑘 𝐴𝑘−1 + 𝛼𝑘 𝑣𝑘 A bit more algebra gives us 𝐴𝑘 = 𝐴𝑘−1 + 𝛼𝑘 (𝑣𝑘 − 𝐴𝑘−1 ) 17 Estimate by Temporal Differences 𝐴𝑘 = 𝐴𝑘−1 + 𝛼𝑘 (𝑣𝑘 − 𝐴𝑘−1 ) (vk – Ak-1) is called a temporal difference error (TD-error) It specifies how different the new value vk is from the prediction given by the previous running average Ak-1 The new estimate (i.e. new average) is obtained by updating the previous average by αk times the TD-error 18 Q-Learning: General Idea Learn from the history of interactions with the environment we will have a sequence of state-action-rewards The agent history is seen as a sequence of experiences Agent is in state s, performs action a, and receives reward r as it ends up in state s’ Each experience is used to estimate a new value for Q(s,a) expressed as Q( s, a) = r +  max Q[ s' , a' ] a' 19 Q-Learning: General Idea However, Q( s, a) = r +  max Q[ s' , a' ] a' is an approximation. The real relation between Q(s,a) and Q(s’,a’), as we found earlier, is Q( s, a) = r +   P( s ' | s, a) max Q( s ' , a' ) a' s' 20 Q-Learning: Main Steps Goal: store Q[S,A] for every state S and action A in the world Start with arbitrary estimates in Q(0)[S,A] Update the Q-values by using experiences Each experience provides one new data point on the actual value of Q[s,a] current estimated value of Q( s, a) = r +  max Q[ s' , a' ] Q[s’,a’], where s’ is the a' state the agent arrives to in the current experience new “data point” for Q[s,a] 21 Q-Learning: Update Step Now we can use our Temporal Difference formula from earlier: 𝐴𝑘 = 𝐴𝑘−1 + 𝛼𝑘 (𝑣𝑘 − 𝐴𝑘−1 ) Apply it to our estimates and new data point for Q[s,a]: ′ ′ 𝑄 𝑠, 𝑎 ← 𝑄 𝑠, 𝑎 + 𝛼𝑘 𝑟 + 𝛾 max ′ 𝑄 𝑠 , 𝑎 − 𝑄 𝑠, 𝑎 𝑎 updated estimated New data point for Q[s,a] value of Q[s,a] from Previous estimated value of Q[s,a] 22 Q-Learning: Algorithm 23 Updating Q-values and Maintaining αk Q[S,A] K[S,A] s1 sN s1 sN a1 a1 experience aM aM 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼𝑘 𝑟 + 𝛾 max 𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎 ′ 𝑎 24 Unit Overview Q-learning Motivation and derivation Example Exploration vs. Exploitation On-policy Learning (SARSA) 25 Example -1 -1 Six possible states {s0, …, s5} + 10 -1 Four actions: UpCareful: moves one tile up (unless there is a wall) generates a penalty of -1 -100 -1 Left: moves one tile left unless there is a wall, in which case stays in same tile if in s0 or s2 sent to s0 if in s4 Right: moves one tile right (unless there is a wall) -1 -1 Up: (probability 0.8) goes up unless there is a wall (probability 0.1) behaves like Left action (probability 0.1) behaves like Right action Reward Model: -1 for taking UpCareful action Reward when hitting a wall, as marked on the picture 26 Example -1 -1 The agent knows about the 6 states and 4 + 10 -1 actions The agent can perform an action, fully observe -100 -1 its state and the reward it gets The agent does not know how the states are -1 configured, nor what the actions do -1 no transition model no reward model 27 Example (using variable αk) -1 -1 Suppose that in this simple world, the agent has the + 10 -1 following sequence of experiences Suppose it repeats this sequence k times -100 -1 not a good behavior for a Q-Learning agent, but good for demonstrating the process The table shows the first 3 “loops” of Q-Learning when -1 -1 Q[s,a] is initialized to 0 for all s and a αk = 1/k, γ = 0.9 loop Q[s0, right] Q[s1, upCareful] Q[s3, upCareful] Q[s5, left] Q[s4, left] 1 0 -1 -1 0 10 2 0 -1 -1 4.5 10 3 0 -1 0.35 6.0 10 28 Example (using variable αk) 𝑠0 , 𝑟𝑖𝑔ℎ𝑡, 0, 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠5 , 𝑙𝑒𝑓𝑡, 0, 𝑠4 , 𝑙𝑒𝑓𝑡, 10, 𝑠0 1st loop (k=1 in each case) ′ ′ 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼 𝑟 + 𝛾max ′ 𝑄 𝑠 ,𝑎 − 𝑄 𝑠, 𝑎 𝑎 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] ← 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠1 , 𝑎′ − 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 𝑎 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 ← 0 + 1 0 + 0.9 ∗ 0 − 0 = 0 Q[s,a] s0 s1 s2 s3 s4 s5 upCare 0 0 0 0 0 0 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9max 𝑄 𝑠3 , 𝑎′ − 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 Left 0 0 0 0 0 0 ′ 𝑎 Right 0 0 0 0 0 0 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← 0 + 1 −1 + 0.9 ∗ 0 − 0 = −1 Up 0 0 0 0 0 0 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠5 , 𝑎′ − 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 -1 -1 𝑎 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← 0 + 1 −1 + 0.9 ∗ 0 − 0 = −1 + 10 -1 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠4 , 𝑎′ − 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 𝑎 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 ← 0 + 1 0 + 0.9 ∗ 0 − 0 = 0 -100 -1 Only immediate rewards ′ 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 (𝑟 + 0.9max ′ 𝑄 𝑠0 , 𝑎 ) − 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 are included in the -1 -1 𝑎 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 ← 0 + 1 10 + 0.9 ∗ 0 − 0 = 10 update in this first pass 29 Example (using variable αk) 𝑠0 , 𝑟𝑖𝑔ℎ𝑡, 0, 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠5 , 𝑙𝑒𝑓𝑡, 0, 𝑠4 , 𝑙𝑒𝑓𝑡, 10, 𝑠0 2nd loop (k=2 in each case) ′ ′ 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼 𝑟 + 𝛾max ′ 𝑄 𝑠 ,𝑎 − 𝑄 𝑠, 𝑎 𝑎 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] ← 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠1 , 𝑎′ − 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 𝑎 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 ← 0 + 1/2 0 + 0.9 ∗ 0 − 0 = 0 Q[s,a] s0 s1 s2 s3 s4 s5 upCare 0 -1 0 -1 0 0 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9max 𝑄 𝑠3 , 𝑎′ − 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 Left 0 0 0 0 10 0 ′ 𝑎 Right 0 0 0 0 0 0 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1/2 −1 + 0.9 ∗ 0 − (−1) = −1 Up 0 0 0 0 0 0 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠5 , 𝑎′ − 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 -1 -1 𝑎 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1/2 −1 + 0.9 ∗ 0 − (−1) = −1 + 10 -1 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠4 , 𝑎′ − 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 𝑎 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 ← 0 + 1/2 0 + 0.9 ∗ 10 − 0 = 4.5 -100 -1 1 step back from previous positive 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 (𝑟 + 0.9max 𝑄 𝑠0 , 𝑎′ ) − 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 -1 -1 ′ 𝑎 reward in s4 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 ← 10 + 1/2 10 + 0.9 ∗ 0 − 10 = 10 30 Example (using variable αk) 𝑠0 , 𝑟𝑖𝑔ℎ𝑡, 0, 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠5 , 𝑙𝑒𝑓𝑡, 0, 𝑠4 , 𝑙𝑒𝑓𝑡, 10, 𝑠0 3rd loop (k=3 in each case) ′ ′ 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼 𝑟 + 𝛾max ′ 𝑄 𝑠 ,𝑎 − 𝑄 𝑠, 𝑎 𝑎 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] ← 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠1 , 𝑎′ − 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 𝑎 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 ← 0 + 1/3 0 + 0.9 ∗ 0 − 0 = 0 Q[s,a] s0 s1 s2 s3 s4 s5 upCare 0 -1 0 -1 0 0 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9max 𝑄 𝑠3 , 𝑎′ − 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 Left 0 0 0 0 10 4.5 ′ 𝑎 Right 0 0 0 0 0 0 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1/3 −1 + 0.9 ∗ 0 − (−1) = −1 Up 0 0 0 0 0 0 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9max ′ 𝑄 𝑠5 , 𝑎′ − 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 -1 -1 𝑎 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1/3 −1 + 0.9 ∗ 4.5 − −1 = 0.35 + 10 -1 effect of positive 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 𝑟 + 0.9max 𝑄 𝑠4 , 𝑎′ − 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 reward in s4 is ′ 𝑎 felt two steps 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 ← 4.5 + 1/3 0 + 0.9 ∗ 10 − 4.5 = 6 -100 -1 earlier at this iteration 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 (𝑟 + 0.9max ′ 𝑄 𝑠0 , 𝑎′ ) − 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 -1 -1 𝑎 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 ← 10 + 1/3 10 + 0.9 ∗ 0 − 10 = 10 31 Example (using variable αk) -1 -1 + 10 -1 -100 -1 -1 -1 As the number of iterations increases, the effect of the positive reward achieved by moving left in s4 trickles further back in the sequence of steps Q[s4, left] starts changing only after the effect of the reward has reached s0 (i.e. after iteration 10 in the table) 32 Example (using fixed α=1) 𝑠0 , 𝑟𝑖𝑔ℎ𝑡, 0, 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠5 , 𝑙𝑒𝑓𝑡, 0, 𝑠4 , 𝑙𝑒𝑓𝑡, 10, 𝑠0 k=2 (1st loop unchanged) ′ ′ 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼 𝑟 + 𝛾max ′ 𝑄 𝑠 ,𝑎 − 𝑄 𝑠, 𝑎 𝑎 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] ← 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] + 𝛼 𝑟 + 0.9max ′ 𝑄 𝑠1 , 𝑎′ − 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 𝑎 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 ← 0 + 1 0 + 0.9 ∗ 0 − 0 = 0 Q[s,a] s0 s1 s2 s3 s4 s5 upCare 0 -1 0 -1 0 0 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼 𝑟 + 0.9max 𝑄 𝑠3 , 𝑎′ − 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 Left 0 0 0 0 10 0 ′ 𝑎 Right 0 0 0 0 0 0 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1 −1 + 0.9 ∗ 0 − (−1) = −1 Up 0 0 0 0 0 0 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼 𝑟 + 0.9max ′ 𝑄 𝑠5 , 𝑎′ − 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 -1 -1 𝑎 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1 −1 + 0.9 ∗ 0 − (−1) = −1 + 10 -1 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] + 𝛼 𝑟 + 0.9max ′ 𝑄 𝑠4 , 𝑎′ − 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 𝑎 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 ← 0 + 1 0 + 0.9 ∗ 10 − 0 = 9 -100 -1 new “data point” is given much 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] + 𝛼 (𝑟 + 0.9max 𝑄 𝑠0 , 𝑎′ ) − 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 -1 -1 ′ 𝑎 more weight 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 ← 10 + 1 10 + 0.9 ∗ 0 − 10 = 10 33 Example (using fixed α=1) 𝑠0 , 𝑟𝑖𝑔ℎ𝑡, 0, 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠5 , 𝑙𝑒𝑓𝑡, 0, 𝑠4 , 𝑙𝑒𝑓𝑡, 10, 𝑠0 k=3 ′ ′ 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼 𝑟 + 𝛾max ′ 𝑄 𝑠 ,𝑎 − 𝑄 𝑠, 𝑎 𝑎 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] ← 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] + 𝛼 𝑟 + 0.9max ′ 𝑄 𝑠1 , 𝑎′ − 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 𝑎 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 ← 0 + 1 0 + 0.9 ∗ 0 − 0 = 0 Q[s,a] s0 s1 s2 s3 s4 s5 upCare 0 -1 0 -1 0 0 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼 𝑟 + 0.9max 𝑄 𝑠3 , 𝑎′ − 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 Left 0 0 0 0 10 9 ′ 𝑎 Right 0 0 0 0 0 0 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1 −1 + 0.9 ∗ 0 − (−1) = −1 Up 0 0 0 0 0 0 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼 𝑟 + 0.9max ′ 𝑄 𝑠5 , 𝑎′ − 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 -1 -1 𝑎 new “data point” 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1 −1 + 0.9 ∗ 9 − −1 = 7.1 + 10 -1 is given much 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] + 𝛼 𝑟 + 0.9max 𝑄 𝑠4 , 𝑎′ − 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 more weight ′ 𝑎 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 ← 9 + 1 0 + 0.9 ∗ 10 − 9 = 9 -100 -1 No change from k=2, since entire reward 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] + 𝛼 (𝑟 + 0.9max 𝑄 𝑠0 , 𝑎′ ) − 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 -1 -1 ′ 𝑎 from next step was 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 ← 10 + 1 10 + 0.9 ∗ 0 − 10 = 10 already included 34 Comparing fixed and variable α Fixed α generates faster updates all states see some effect of the fixed α positive reward from by the 5th iteration Each update is much larger Gets very close to final numbers by iteration 40, while with variable α still not there by variable α iteration 107 However, Q-learning with fixed α is not guaranteed to converge 35 Unit Overview Q-learning Motivation and derivation Example Exploration vs. Exploitation On-policy Learning (SARSA) 36 Another look at the approximation Q( s, a) = R( s ) +   P( s ' | s, a) max Q( s ' , a' ) true relation between Q(s,a) and Q(s’,a’) a' s' Q[ s, a]  Q[ s, a] +  ((r +  max Q[ s' , a' ]) − Q[ s, a]) Q-Learning approximation based on experiences a' For the approximation to work: A. There must be a positive reward in most states B. Q-Learning must try each action an unbounded number of times C. The transition model must not be sparse D. The Force must be with you…always E. You must know where your towel is 37 Aside: Matrix “sparseness” The number of zero elements of a matrix (let’s say n x n) divided by the total number of elements. For conditional probabilities the maximum sparseness is… Density = (1 – sparseness) The minimum density for conditional probabilities is… Note: in such cases we have deterministic actions! 38 Why the approximation works Q( s, a) = R( s ) +   P( s ' | s, a) max Q( s ' , a' ) true relation between Q(s,a) and Q(s’,a’) a' s' Q[ s, a]  Q[ s, a] +  ((r +  max Q[ s' , a' ]) − Q[ s, a]) Q-Learning approximation based on a' Allows us to get around the missing transition model and reward model Are we in danger of using data from unlikely transitions to make incorrect adjustments? NO, as long as Q-Learning tries each action an unbounded number of times Then the frequency of updates reflects the transition model 39 What does Q-Learning learn? Does the Q-Learning algorithm output an optimal policy? A. Yes B. No, because the output is some other policy C. No, because the output is a Q-function (set of Q[s,a] values) D. It depends on how many experiences were used E. Tricksy hobbitses and their tricksy riddles, we hates them! 40 Q-values s0 s1 … sn a0 Q[s0,a0] Q[s1,a0] …. Q[sn,a0] a1 Q[s0,a1] Q[s1,a1] … Q[sn,a1] … … … …. … am Q[s0,am] Q[s1,am] …. Q[sn,am] 41 Exploration vs. Exploitation Q-Learning does not explicitly tell the agent what to do It computes a Q-function Q[s,a] that allows the agent to see, for every state, which is the action with the highest expected reward Given a Q-function the agent can: Exploit the knowledge accumulated so far, and choose the action that maximizes Q[s,a] in a given state (greedy behavior) Explore new actions, hoping to improve its estimate of the optimal Q-function, i.e. ignore the current Q[s,a] 42 Exploration vs. Exploitation Which statements is/are true? 1. Not exploring enough may lead to being stuck in a suboptimal course of action 2. Exploring too much is a waste of the knowledge accumulated via experience A. Only (1) is true B. Only (2) is true C. Both are true D. Neither are true E. The only truth is 42 43 Exploration vs. Exploitation 1. Not exploring enough may lead to being stuck in a suboptimal course of action 2. Exploring too much is a waste of the knowledge accumulated via experience We need to find the right compromise 44 Exploration Strategies Not trivial to come up with an optimal exploration policy widely studied problem in statistical decision theory Intuitively, any such strategy should be greedy in the limit of infinite exploration (GLIE) Try each action an unbounded number of times Choose the predicted best action in the limit We will look at two exploration strategies: ε-greedy soft-max 45 ε-greedy Choose a random action with probability ε, and choose the best action with probability 1-ε First GLIE condition (try every action an unbounded number of times) is satisfied by the possibility of random selections What about the second condition (choose the predicted best action in the limit)? reduce ε over time! 46 Soft-Max Takes into account improvements in estimates of Q[s,a] Q[ s , a ] Choose action a in state s with probability based on e current estimate of Q[s,a] e a Q[ s , a ] Example: assume only 3 actions Q[s,a] si a1 2 a2 5 a3 1 Probability of selecting action ai 47 (τ controlled) Soft-Max Takes into account improvements in estimates of Q[s,a] Choose action a in state s with probability based on current estimate of Q[s,a] Q[ s , a ] /  e e Q[ s , a ] /  τ (tau) influences how randomly actions should be chosen a If τ is high, the exponentials approach 1, the fraction approaches 1/(number of actions), and each action has roughly the same probability of being chosen (exploration or exploitation?) If τ is low, the exponential with the highest Q[s,a] dominates, and the current best action is nearly always chosen (exploration or exploitation?) 48 (τ controlled) Soft-Max: Example e Q[ s , a ] /  Example: assume only 3 actions Q[s,a] a1 si 2 τ=100 τ=0.5 e a Q[ s , a ] /  a2 5 a3 1 Probability of τ=1 selecting action ai τ=100 τ=0.5 49 (τ controlled) Soft-Max Takes into account improvements in estimates of Q[s,a] Choose action a in state s with probability based on current estimate of Q[s,a] Q[ s , a ] /  e τ (tau) influences how randomly actions should be chosen e a Q[ s , a ] /  If τ is high, (much higher than Q[s,a]), the agent… A. … will mainly exploit B. … will mainly explore C. … will be equally likely to explore and exploit D. … will do nothing E. … will give up and go back to playing Minecraft 50 Unit Overview Q-learning Motivation and derivation Example Exploration vs. Exploitation On-policy Learning (SARSA) 51 Learning Before vs. During Deployment Our learning agent can: act in the environment to learn how it works (before deployment) learn as it goes (after deployment) If there is time to learn before deployment, the agent should try to do its best to learn as much as possible about the environment even engage in locally suboptimal behaviors, because this will guarantee reaching an optimal policy in the long run If learning after deployment, suboptimal behaviors could be costly 52 Example -1 -1 Six possible states {s0, …, s5} + 10 -1 Four actions: UpCareful: moves one tile up (unless there is a wall); always generates a penalty of -1 -100 -1 Left: moves one tile left unless there is a wall, in which case stays in same tile if in s0 or s2 sent to s0 if in s4 Right: moves one tile right (unless there is a wall) Up: -1 -1 (probability 0.8) goes up unless there is a wall (probability 0.1) behaves like Left action (probability 0.1) behaves like Right action Reward Model: -1 for taking UpCareful action Negative reward when hitting a wall, as marked on the picture +10 for left in s4 53 Example -1 -1 the optimal policy is to go up in s0 + 10 -1 But if the agent includes some exploration in its policy (e.g. selects 20% of its actions randomly), -100 -1 exploring in s2 could be dangerous because it may cause hitting the -100 wall No big deal if the agent is not deployed yet, but not -1 ideal otherwise -1 Q-learning would not detect this problem It does off-policy learning, i.e., it ignores the policy currently followed by the agent and focuses on the optimal policy On-policy learning addresses this problem 54 On-Policy Learning: SARSA On-policy learning learns the value of the policy being followed eg. act greedily 80% of the time and randomly 20% of the time Motivation: better to be aware of the consequences of exploration as it happens, and avoid outcomes that are too costly while acting, than focus only on the true optimal policy SARSA Uses experiences instead of just Instead of looking for the best action at every step, it evaluates the actions suggested by the current policy and uses this information in its updates 55 On-Policy Learning: SARSA In Q-Learning we assumed that the agent in s’ will follow the optimal policy 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼((𝑟 + 𝛾max𝑄[𝑠′, 𝑎′]) − 𝑄[𝑠, 𝑎]) 𝑎′ Given an experience , SARSA updates Q[s,a] using the information that the current policy has selected a’ so how do we update our Q[s,a]? 56 On-Policy Learning: SARSA In Q-Learning we assumed that the agent in s’ will follow the optimal policy 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼((𝑟 + 𝛾max𝑄[𝑠′, 𝑎′]) − 𝑄[𝑠, 𝑎]) 𝑎′ Given an experience , SARSA updates Q[s,a] using the information that the current policy has selected a’ so how do we update our Q[s,a]? 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼( 𝑟 + 𝛾𝑄[𝑠′, 𝑎′] − 𝑄[𝑠, 𝑎]) 57 Example (using SARSA with variable αk) 𝑠0 , 𝑟𝑖𝑔ℎ𝑡, 0, 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒, −1, 𝑠5 , 𝑙𝑒𝑓𝑡, 0, 𝑠4 , 𝑙𝑒𝑓𝑡, 10, 𝑠0 , 𝒓𝒊𝒈𝒉𝒕 k=2 𝑄[𝑠, 𝑎] ← 𝑄[𝑠, 𝑎] + 𝛼 𝑟 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] ← 𝑄[𝑠0 , 𝑟𝑖𝑔ℎ𝑡] + 𝛼𝑘 𝑟 + 0.9𝑸 𝒔𝟏 , 𝒖𝒑𝑪𝒂𝒓𝒆 − 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 𝑄 𝑠0 , 𝑟𝑖𝑔ℎ𝑡 ← 0 + 1/2 0 + 0.9 ∗ −𝟏 − 0 = −0.45 Q[s,a] s0 s1 s2 s3 s4 s5 upCare 0 -1 0 -1 0 0 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9𝑸 𝒔𝟑 , 𝒖𝒑𝑪𝒂𝒓𝒆 − 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 left 0 0 0 0 10 0 right 0 0 0 0 0 0 𝑄 𝑠1 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1/2 −1 + 0.9 ∗ −𝟏 − −1 = −1.45 up 0 0 0 0 0 0 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] ← 𝑄[𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒] + 𝛼𝑘 𝑟 + 0.9𝑸 𝒔𝟓 , 𝒍𝒆𝒇𝒕 − 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 -1 -1 𝑄 𝑠3 , 𝑢𝑝𝐶𝑎𝑟𝑒 ← −1 + 1/2 −1 + 0.9 ∗ 𝟎 − (−1) = −1 + 10 -1 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠5 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 𝑟 + 0.9𝑸 𝒔𝟒 , 𝒍𝒆𝒇𝒕 − 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 SARSA uses the 𝑄 𝑠5 , 𝐿𝑒𝑓𝑡 ← 0 + 1/2 0 + 0.9 ∗ 𝟏𝟎 − 0 = 4.5 -100 -1 expected reward of the next action 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] ← 𝑄[𝑠4 , 𝐿𝑒𝑓𝑡] + 𝛼𝑘 (𝑟 + 0.9𝑸 𝒔𝟎 , 𝒓𝒊𝒈𝒉𝒕 ) − 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 -1 -1 𝑄 𝑠4 , 𝐿𝑒𝑓𝑡 ← 10 + 1/2 10 + 0.9 ∗ 𝟎 − 10 = 10 58 Comparing Q-Learning and SARSA -1 -1 For this 6-state world: + 10 -1 Policy learned by Q-learning 80% greedy is to go -1 up in s0 to reach s4 quickly and get the big +10 -100 reward -1 -1 Iterations Q[s0,Up] Q[s1,Up] Q[s2,UpC] Q[s3,Up] Q[s4,Left] Q[s5,Left] 40000000 19.1 17.5 22.7 20.4 26.8 23.7 59 Comparing Q-Learning and SARSA -1 -1 For this 6-state world: + 10 -1 Policy learned by SARSA (80% greedy) is to go -1 right in s0 if being greedy -100 Safer because it avoids getting the -100 reward in s2 if the next action happens to be -1 -1 random However, non-optimal, so lower Q-values Iterations Q[s0,Right] Q[s1,Up] Q[s2,UpC] Q[s3,Up] Q[s4,Left] Q[s5,Left] 40000000 6.8 8.1 12.3 10.4 15.6 13.2 60 Comparing Q-Learning and SARSA This could be, for instance, any ε-greedy strategy: choose random action with frequency ε, and “best” action otherwise Instead of taking the max over a’ 61 Another Example Grid world with: Deterministic actions up, down, left, right Start from S and arrive at G (terminal state with reward > 0) Reward is -1 for all transitions, except those into the region marked “Cliff” Falling into the cliff causes the agent to be sent back to start: r = -100 S CLIFF G 62 Another Example With an ε-greedy strategy, when exploiting: A. SARSA will result in policy p1 while Q-Learning will result in p2 B. Q-Learning will result in policy p1 while SARSA will result in p2 C. They will both result in p1 D. They will both result in p2 E. They will take the red pill and exit the Matrix S CLIFF G 63 Q-Learning vs. SARSA Q-learning learns the optimal policy, but because it does so without taking exploration into account; it does not do so well while the agent is exploring It occasionally falls into the cliff, so reward per episode is not that great SARSA has better on-line performance (reward per episode), because it learns to stay away from the cliff while exploring But note that if ε→0, SARSA and Q-learning would asymptotically converge to the optimal policy 64 Recommendations If the agent is not deployed it should always explore (ε = 1) use Q-Learning Deploy once Q-values have converged If the agent is deployed it should use an explore/exploit strategy (eg. ε = 0.5) use SARSA Become more greedy over time 65 Aside: RL Scalability: Policy Gradient Methods Problems for most value-function methods (like Q-Learning, SARSA) Continuous states and actions in high dimensional spaces Uncertainty on the state information Convergence Popular Solution: Policy Gradient Methods…. But still limited On-policy by definition (like SARSA) Find Local maximum (while global for value-function methods) Open Parameter: learning rate (to be set empirically) eg. Peters, J. and Bagnell, J.A., 2010. Policy gradient methods. Scholarpedia, 5(11), p.3698. 66

CPSC 422: Intelligent Systems - Reinforcement Learning PDF

Document Details

Tags

Related

Summary

Full Transcript