AI 305 Reinforcement Learning PDF
Document Details
Uploaded by UncomplicatedLime804
Stanford
Tags
Summary
These lecture notes provide an introduction to reinforcement learning, a machine learning technique used to train agents to make decisions in an environment to maximize a cumulative reward. The notes cover topics such as single state K-armed bandits, temporal difference learning, and the Q-learning algorithm.
Full Transcript
Introduction to Machine learning AI 305 Reinforcement Learning 1 Introduction 2 We want to build a machine that learns to play chess. In this case we cannot use a supervised learner for two reasons....
Introduction to Machine learning AI 305 Reinforcement Learning 1 Introduction 2 We want to build a machine that learns to play chess. In this case we cannot use a supervised learner for two reasons. it is very costly to have a teacher that will take us through many games and indicate us the best move for each position. in many cases, there is no such thing as the best move; the goodness of a move depends on the moves that follow. A single move does not count; a sequence of moves is good if after playing them we win the game. The only feedback is at the end of the game when we win or lose the game. Introduction 3 Another example is a robot that is placed in a maze. The robot can move in one of the four compass directions and should make a sequence of movements to reach the exit. As long as the robot is in the maze, there is no feedback and the robot tries many moves until it reaches the exit and only then does it get a reward. In this case there is no opponent, but we can have a preference for shorter trajectories, implying that in this case we play against time. Introduction 4 Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a goal Agent has a state in an environment, takes an action and sometimes receives reward and the state changes Introduction 5 Reinforcement learning: is an area in machine learning that is concerned with how to teach a computer agent to take the best possible actions in a long interaction course with the environment. Different from the standard supervised learning, the learning agent in a reinforcement learning setting does not receive any strong (explicit) supervision from the environment regarding what the best action is at each step. Instead, the agent only occasionally receives some numerical rewards (positive or negative). Introduction 6 Reinforcement learning can be viewed as an approach which falls between supervised and unsupervised learnin g. It is not strictly supervised as it does not rely only on a set of labelled training data but is not unsupervised learning because we have a reward which we want our agent to maximize. The agent needs to find the “right” actions to take in different situations to achieve its overall goal. Single State: K-armed 7 Bandit Among K levers, choose the one that pays best Q(a): value of action a Reward (deterministic) is ra Set Q(a) = ra Choose a* if Q(a*)=maxa Q(a) Rewards stochastic (keep an expected reward): anaverage of all rewards received when action a was chosen before time t. An online update can be defined as 𝑄𝑡 + 1 ( 𝑎 ) ← 𝑄 𝑡 ( 𝑎 ) +𝜂 [ 𝑟 𝑡 +1 ( 𝑎 ) − 𝑄𝑡 ( 𝑎) ] Single State: K-armed 8 Bandit-cont. The full reinforcement learning problem generalizes this simple case in a number of ways. First, we have several states. This corresponds to having several slot machines with different reward probabilities, p(r|si, aj ), and we need to learn Q(si, aj ), which is the value of taking action aj when in state si. Second, the actions affect not only the reward but also the next state, and we move from one state to another. Third, the rewards are delayed and we need to be able to estimate immediate values from delayed rewards. Elements of RL (Markov Decision Processes) 9 st : State of agent at time t at: Action taken at time t In st, action at is taken, clock ticks and reward rt+1 is received and state changes to st+1 Next state prob: P (st+1 | st , at ) Reward prob: p (rt+1 | st , at ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal The problem is modeled using a Markov process decision process (MDP). Policy and Cumulative 10 Reward The policy, π, defines the agent’s behavior and is a mapping from the states of the𝜋environment :S→ A 𝑎 =𝜋 to ( 𝑠 )actions, 𝑡 𝑡 The value of a policy π, , is the expected cumulative reward that will be received while the agent follows the policy, starting from state , Finite vs. Infinite horizon 11 Finite-horizon: agent tries to maximize the expected reward for the next T steps infinite-horizon model, there is no sequence limit, but future rewards are discounted to keep the return finite Infinite horizon 12 If , then only the immediate reward counts. As approaches 1, rewards further in the future count more, we say that the agent becomes more farsighted. is less than 1 because there generally is a time limit to the sequence of actions needed to solve the task. The agent may be a robot that runs on a battery. We prefer rewards sooner rather than later because we are not certain how long we will survive. Bellman’s equation 13 For each policy π, optimal policy there is a , and we want to find the optimal policy π∗ such that The value of a state is equal to the value of the best possible action: Bellman’s equation-cont. 14 To each possible next state st+1, we move with probability P(st+1|st, at), and continuing from there using the optimal policy, the expected cumulative reward is We sum over all such possible next states, and we discount it because it is one time step later. Adding our immediate expected reward, we get the total expected cumulative reward for action at. We then choose the best of all possible actions. Bellman’s equation-cont. 15 In some applications, for example, in control, instead of working with the values of states, , we prefer to work with the values of state- action pairs,. denotes how good it is for the agent to be in state , whereas denotes how good it is to perform action at when in state. Similarly, we can also write: Bellman’s equation-cont. 16 Once we have values, we can then define our policy π as taking the action , which has the highest value among all : This means that if we have the values, then by using a greedy search at each local step we get the optimal sequence of steps that maximizes the cumulative reward. Model-Based Learning 17 Environment, P (st+1 | st , at ), p (rt+1 | st , at ) known There is no need for exploration Can be solved using dynamic programming The optimal value function is unique and is the solution to the Bellman’s equations ( 𝑉 ( 𝑠𝑡 )=max 𝐸 [ 𝑟 𝑡+1 ] +𝛾 ∑ 𝑃 ( 𝑠 𝑡+1∨𝑠𝑡 ,𝑎𝑡 ) 𝑉 ( 𝑠 𝑡+1 ) ) ∗ ∗ 𝑎𝑡 𝑠𝑡+1 Optimal policy: Once we have the optimal value function, the optimal policy is to choose the action that maximizes the value in the next state: ( 𝜋∗ ( 𝑠𝑡 )=arg max 𝐸 [ 𝑟 𝑡+1∨𝑠𝑡 ,𝑎𝑡 ] +𝛾 ∑ 𝑃 ( 𝑠𝑡+1 ∨𝑠𝑡 ,𝑎𝑡 ) 𝑉 ( 𝑠𝑡+1 ) ) ∗ 𝑎𝑡 𝑠 𝑡+1 Value Iteration 18 To find the optimal policy, we can use the optimal value function, and there is an iterative algorithm called value iteration to converge to the correct values. We say that the values converged if the maximum value difference between two iterations is less than a certain threshold Policy Iteration 19 In value iteration, because we care only about the actions with the maximum value, it is possible that the policy converges to the optimal one even before the values converge to their optimal values In policy iteration, we store and update the policy rather than doing this indirectly over the values. Policy Iteration 20 The idea is to start with a policy and improve it repeatedly until there is no change. The value function can be calculated by solving for the linear equations. We then check whether we can improve the policy by taking these into account. This step is guaranteed to improve the policy, and when no improvement is possible, the policy is guaranteed to be optimal. The time complexity of each policy iteration is more than that of value iteration, but policy iteration needs fewer iterations than value iteration. Temporal Difference 21 Learning Model is defined by the reward and next state probability distributions, when we know these, we can solve for the optimal policy using dynamic programming. However, these methods are costly, and we seldom have such perfect knowledge of the environment. The more interesting and realistic application of reinforcement learning is when we do not have the model. Temporal Difference 22 Learning-cont. Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not known; model-free learning There is need for exploration of the environment to query the model, to sample from P (st+1 | st , at ) and p (rt+1 | st , at ) We explore and get to see the value of the next state and reward, we use this information to update the value of the temporal current state. Use the reward received in the next time step to update the value of current state (action) The temporal difference between the value of the current state (state-action) and the value discounted from the next state and the reward received Exploration Strategies (ε-greedy search) 23 With probability ε, we choose one action uniformly randomly among all possible actions, namely, explore, and with probability 1 − ε , we choose the best action, namely, exploit. We do not want to continue exploring indefinitely but start exploiting once we do enough exploration; for this, we start with a high ε value and gradually decrease it. Move smoothly from exploration/exploitation. We need to make sure that our policy is soft, that is, the probability of choosing any action a ∈ A in state s ∈ S is greater than 0 We can choose probabilistically, using the softmax function to exp 𝑄 ( 𝑠 , 𝑎 ) convert values to probabilities: 𝑃 ( 𝑎∨𝑠 ) = A ∑ exp𝑄 ( 𝑠 , 𝑏 ) 𝑏=1 Exploration Strategies 24 (Annealing) To gradually move from exploration to exploitation, we can use a “temperature” variable T and define the probability of choosing action a as exp [ 𝑄 ( 𝑠 , 𝑎 ) / 𝑇 ] 𝑃 ( 𝑎∨ 𝑠 ) = A ∑ exp [ 𝑄 ( 𝑠 ,𝑏 ) / 𝑇 ] 𝑏=1 When T is large, all probabilities are equal and we have exploration. When T is small, better actions are favored. So the strategy is to start with a large T and decrease it gradually, This procedure named annealing, which in this case moves from exploration to exploitation smoothly in time. Deterministic Rewards and Actions 25 In model-free learning, we first discuss the simpler deterministic case, where at any state-action pair, there is a single reward and next state possible. 𝑄 ( 𝑠 , 𝑎 ) =𝑟 In this case, Bellman’s 𝑡 +𝛾 max 𝑄 ( 𝑠 𝑡 equation 𝑡 +1 ,𝑎 is reduced to: 𝑡 +1 𝑡 +1) 𝑎𝑡 + 1 And used as an ^( 𝑄 update ) 𝑠 𝑡 , 𝑎𝑡 ←𝑟 +𝛾 maxfor rule𝑡 +1(backup) 𝑄 the ^( 𝑠𝑡 + 1estimated , 𝑎𝑡 + 1)value. 𝑎𝑡 +1 is a later value and has a higher chance of being correct We discount this by γ and add the immediate reward (if any) and take this as the new estimate for backup the previous. This is called a backup because it can be viewed as taking the estimated value of an action in the next time step and “backing it up” to revise the estimate for the value of a current action. Starting at zero, Q values increase, never decrease This is a deterministic grid-world where G is the goal state with reward 100, all other immediate rewards are 0, and γ = 0.9. Let us consider the Q value of the transition marked by asterisk, and let us just consider only the two paths A and B. 26 γ=0.9 Consider the value of action marked by ‘*’: If path A is seen first, Q(*)=0.9*max(0,81)=73 Then B is seen, Q(*)=0.9*max(100,81)=90 Or, If path B is seen first, Q(*)=0.9*max(100,0)=90 Then A is seen, Q(*)=0.9*max(100,81)=90 Q values increase but never decrease Q values increase until they reach their optimal values as we find paths with higher cumulative reward, for example, 27 shorter paths, but they never decrease Deterministic Rewards and 28 Actions We do not know the reward or next state functions here. They are part of the environment, and it is as if we query them when we explore. We are not modeling them. We just accept them as given and learn directly the optimal policy through the estimated value function. Nondeterministic Rewards and 29 Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments For example, we may have an imperfect robot which sometimes fails to go in the intended direction and deviates, or advances shorter or longer than expected. We cannot do a direct assignment in this case because for the same state and action, we may receive different rewards or move to different next states. What we do is keep a running average. This is known as the Q learning algorithm ^ ( 𝑠 ,𝑎 ) ← 𝑄 𝑄 𝑡 𝑡 𝑡 𝑡 ( ^ ( 𝑠 ,𝑎 ) +𝜂 𝑟 +𝛾 max 𝑄 𝑡+1 ^ ( 𝑠 ,𝑎 ) − 𝑄 𝑎𝑡 +1 𝑡 +1 𝑡+1 ^ ( 𝑠 ,𝑎 ) 𝑡 𝑡 ) Q-Learning 30 Q-learning: ^ ( 𝑠 ,𝑎 ) ← 𝑄 𝑄 𝑡 𝑡 𝑡 𝑡 ( ^ ( 𝑠 ,𝑎 ) +𝜂 𝑟 +𝛾 max 𝑄 𝑡+1 ^ ( 𝑠 ,𝑎 ) − 𝑄 𝑡 +1 𝑎𝑡 +1 𝑡+1 ^ ( 𝑠 ,𝑎 ) 𝑡 𝑡 ) values as a sample of instances for each (st, at) pair and we would like ˆ to converge to its mean. η is gradually decreased in time for convergence, and it has been shown that this algorithm converges to the optimal Q∗ values This equation can be seen as reducing the difference between the current Q value and the backed-up estimate, from one time step later (temporal difference TD algorithms). 𝑉 ( 𝑠 𝑡 ) ← 𝑉 ( 𝑠 𝑡 ) +𝜂 (𝑟 𝑡 +1 +𝛾 𝑉 ( 𝑠 𝑡 +1 ) − 𝑉 ( 𝑠 𝑡 ) ) This is an off-policy method as the value of the best next action is used without using the policy In an on-policy method, the policy is used to determine also the next action. The on-policy version of Q learning is the Sarsa algorithm Q-learning 31 Sarsa 32 33 Instead of looking for all possible next actions a and choosing the best, the on-policy Sarsa uses the policy derived from Q values to choose one next action a and uses its Q value to calculate the temporal difference. On-policy methods estimate the value of a policy while using it to take actions. In off-policy methods, these are separated, and the policy used to generate behavior, called the behavior policy, may in fact be different from the policy that is evaluated and improved, called the estimation policy Generalization 36 Until now, we assumed that the Q(s, a) values (or V(s), if we are estimating values of states) are stored in a lookup table, and the algorithms we considered earlier are called tabular algorithms. There are a number of problems with this approach: when the number of states and the number of actions is large, the size of the table may become quite large; states and actions may be continuous, for example, turning the steering wheel by a certain angle, and to use a table, they should be discretized which may cause error; when the search space is large, too many episodes may be needed to fill in all the entries of the table with acceptable accuracy. Generalization 37 Instead, we can consider this a regression problem. We define a regressor Q(s, a|θ), taking s and a as inputs and parameterized by a vector of parameters, θ, to learn Q(s,a) or V(s) values 2 𝐸 (𝛉)=[𝑟𝑡+1+𝛾𝑄 (𝑠𝑡+1 ,𝑎𝑡+1 )− 𝑄 (𝑠𝑡 ,𝑎𝑡 )] 𝑡 Partially Observable States 38 The agent does not know its state but receives an observation p(ot+1|st,at) which can be used to infer a belief about states Partially observable MDP The Tiger Problem 39 Let us say we are standing in front of two doors, one to our left and the other to our right, leading to two rooms. Behind one of the two doors, we do not know which, there is a tiger, and behind the other, there is a treasure. If we open the door of the room where the tiger is, we get a large negative reward, and if we open the door of the treasure room, we get some positive reward. The hidden state, , is the location of the tiger. Let us say p denotes the probability that tiger is in the room to the left and therefore, the tiger is in the room to the right with probability 1 − p: p=P(=1) The Tiger Problem 40 The two actions are and , which respectively correspond to opening the left or the right door. The rewards are: We can calculate the expected reward for the two actions. There are no future rewards because the episode ends once we open one of the doors.