Reinforcement Learning for Control and Design Optimization PDF
Document Details

Uploaded by JoyfulKeytar294
Bojana Rosic
Tags
Summary
This document presents a lecture on reinforcement learning (RL) for control and design optimization. It introduces basic RL concepts, types of algorithms, and how RL differs from traditional control methods. The lecture also addresses the use of RL in design optimization, with examples such as Tic-Tac-Toe.
Full Transcript
Reinforcement learning for control and design optimization BOJANA ROSIC AMDA GROUP, ET Learning objectives 2 After this lecture: you will understand the term Reinforcement learning (RL) understand and recall basic terms/elements...
Reinforcement learning for control and design optimization BOJANA ROSIC AMDA GROUP, ET Learning objectives 2 After this lecture: you will understand the term Reinforcement learning (RL) understand and recall basic terms/elements of RL list of types of RL algorithms understand difference between RL and control Ivan Petrovich understand how RL can be used for design optimization Pavlov wiki General course info 3 Frequency: 2 lecture + 2 tutorials per week Ivan Petrovich Pavlov Lecture format: slides, no lecture notes (see literature). Tutorials: solve the assignment and pose relevant questions to teachers. wiki General course info 4 Programming language: Python+ OpenAI Assignments: 4 homeworks to pass 60% of the total number points for all assignments. Self-assesment Exam: written exam that includes both theory and coding. AI use: In this course use of any AI bots or automated programes used for coding/writing reports are not allowed. What is reinforcement learning? 5 The term reinforcement was introduced by Pavlov in 1903 Reinforcement describes the strengthening of the association between an unconditioned and a conditioned stimulus that results when the two are presented together. Ivan Petrovich Pavlov wiki What is reinforcement learning? 6 Unconditioned stimulus: Dogs do not need to learn to salivate when they see food. Conditioned stimulus: By repeated presentation of conditioned stimulus paired with unconditioned one, dog learns to salivate when only seeing the bell. Reinforcement learning 7 Edward Thorndike (1898) introduced the concept. Edward Thorndike Imada & Imada, 1983 pinterest Law of effect 8 Selectional (also called search): try alternatives and select by comparing consequences. Associative (also called memory) situation). remembering what actions worked best, associating them with the situations in which they were the best. Combining search and memory in this way is essential to reinforcement learning. But what this has to do with Engineering? 9 Ivan Petrovich Pavlov wiki But what this has to do with Engineering? 10 How it this achieved? 11 Time of each trial Agent Cumulative reward over trials Your turn 12 Walk from Start to Exit button in shortest path by not hitting a mine, and collecting maximum number of points. Moving in any direction is of equal probability. RL elements 13 Action obtained through Policy (that maps state to Action) Current State (from 5x5 Environment states) Agent: you Cumulative Reward RL elements 14 RL goal 15 Make an autonomous agent that maximizes cumulative reward! RL goal 16 After maximizing reward! TIC-TAC-TOE Tic-Tac-Toe - Play retro Tic-Tac-Toe online for free (playtictactoe.org) TIC-TAC-TOE Let us consider draws and losses to be equally bad for us: loss draw How might we construct a player that will find the imperfections in its opponent’s play and learn to maximize its chances of winning? We do not assume a particular way of playing by the opponent. TIC-TAC-TOE: classical optimization Sequential decision problem (such as dynamic programming) can compute an optimal solution for any opponent, but require as input a complete specification of that opponent including the probabilities with which the opponent makes each move in each board state. Such information can be estimated from experience, in this case by playing many games against the opponent. Hence, one has to first to learn a model of the opponent’s behavior, up to some level of confidence, and then apply optimization. Let us assume that this information is not available a priori for this problem, as it is not for the vast majority of problems of practical interest. Minimax two-player zero-sum game means that there are two players whose actions alternate and the game is fully observable and deterministic. Each player knows all possible moves the opponent can make (unlike Scrabble for example) and objective values of game states have to be opposite numbers. The goal of minimax is to choose the optimal move while assuming the opponent is also playing optimally. Minimax Choose the optimal move assuming that opponent plays optimaly. Scores: Win, +1 Loss, -1 Draw, 0 Minimax Intermediate Scores: If it is X, we pick the child with the highest score and assign that to the node (X picks a move that maximizes its score). If it is O, we do the opposite and pick the child with the least score. The intuition here is that O will play optimally and pick a move that results in the least benefit to X (i.e. it minimizes the score). So, in theory we can precompute the best move at each game state and then store it in a lookup table. This results in a fast (and predictable) bot that is unbeatable. Assumes that the opponent is a perfectly rational agent, who always performs optimal actions. Reinforcement learning Reinforcement learning is a type of learning in which an agent learns to perform a task through repeated trial-and-error (search-and-memory) interactions with a dynamic environment. RL agent is a complete, interactive, goal-seeking agent. All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environment. Reinforcement learning at every time step the agent perceives the state of the environment based on this perception the agent makes an action this action causes the agent to receive the numerical reward Find a way of choosing actions, called policy which maximizes the agent’s long term expected return. RL elements Environment (Current) State Action Reward (here 1) This is also the end of the game. RL policy A policy defines the learning agent’s way of behaving at a given time, and is a mapping from perceived states of the environment to actions to be taken when in those states. Policy State Action Value of state The value of a state is the total amount of cumulative reward, i.e. the total reward an agent can expect to accumulate over the future, starting from that state. Long-term value State of not so State of high Win State high reward reward A state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. Value of state The value of a state is the total amount of cumulative reward, i.e. the total reward an agent can expect to accumulate over the future, starting from that state. Long-term value State of high State of not so reward high reward A state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. Value and action Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. Value and action Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. Value and action Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. Value and action Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. Exploitation vs exploration To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward (exploitation). But to discover such actions, it has to try actions that it has not selected before. (exploration) The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. Exploitation vs exploration To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward (exploitation). But to discover such actions, it has to try actions that it has not selected before. (exploration) Going all out in either one of them is harmful, all exploitation may lead to a suboptimal agent, and all exploration would just give us a stupid agent which keeps taking random actions. Exploitation vs exploration Why exploitation and exploration? Exploitation: if you constantly played one of the machines (exploitation) you would never learn anything about the odds of the other machines. Exploration: if you always picked a machine at random (exploration) you would learn a lot about the odds of each machine but Each machine has its own probability of winning. probably wouldn’t make as much money as you could have always playing the “best” machine. Exploitation vs exploration Why exploitation and exploration? the debate over whether to prescribe available therapies, such as quinine-based antimalarial drugs (eg, chloroquine or hydroxychloroquine), or test these drugs in randomized clinical trials (RCTs) Exploitation: acting on current knowledge, habits, or beliefs despite uncertainty. This is the “just do it” option: give various therapies (eg, chloroquine) to affected patients based on current knowledge or a hunch. Exploration: actions taken to generate new knowledge and reduce uncertainty, eg, testing therapies in an RCT. Exploitation vs exploration Why exploitation and exploration? Exploitation: if you constantly read the existing theories (exploitation) you would never have time to do your own research (problem solving). Exploration: if you always spend time on your own research (exploration) you would learn a lot about the problem issues but probably wouldn’t make as optimal solution as if you knew better theory. Exploitation vs exploration Why exploitation and exploration? First, an agent is trained playing with random moves. Then, the agent is trained playing with the best moves it learnt. Finally, the agent is trained playing but this time it choses both random and best moves. Model of environment Model of the environment mimics the behavior of the environment, or more generally, allows inferences to be made about how the environment will behave. Model predicts expected next state and next reward based on the previous state and the action!! Thus, RL-methods can be classified as: Model-based (ground truth model is known, or is learned if not available; prone to bias) Model-free RL classification Focus of the course! Back to example Going back to the tic-tac-toe problem. Environment (Current) State Action Reward (here 1) This is also the end of the game. Tic-tac-toe Now play a lot of games. The next step: Current state Next state Exploitation: Just pick the next state with the highest probability of wining, i.e. the largest V(s), also known as greedy move. Exploration: Or pick the next state randomly. Tic-tac-toe: state value update While we are playing, we change the values of the states in which we find ourselves during the game. This can be done in many ways, but one example is (also known as temporal difference rule): in which is the value of the state, is the current state, is the state after the move, and is the step-size parameter. Tic-tac-toe: state value update Action Greedy move Tic-tac-toe: state value update The value of past states change everytime we make a move. Hence, we need their constant evaluation. Action Update Tic-tac-toe: state value update Update Tic-tac-toe: state value update When the step-size parameter is reduced properly over time, this method converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player. If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing. Tic-tac-toe This simple example illustrates some of the key features of reinforcement learning methods. First, there is the emphasis on learning while interacting with an environment, in this case with an opponent player. Second, there is a clear goal, and correct behavior requires planning or foresight that takes into account delayed effects of one’s choices. The effects of planning and lookahead are achieved without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions When to use RL? Reinforcement learning is not resitrected to a two-person game, but also applies in the case in which there is no external adversary, that is in the case of a “game against nature” problems in which behavior breaks down into separate episodes with reward only at the end of each episode. It is just as applicable when behavior continues indefinitely and when rewards of various magnitudes can be received at any time. problems with a relatively small, finite state set, but also can be used when the state set is very large, or even infinite. But what this has to do with Engineering? 50 Ivan Petrovich Pavlov wiki RL based control Low-level Control of a Quadrotor with Reinforcement Learning RL vs Control Reinforcement learning maximizes state value, whereas optimal control minimizes state cost. Agent = Decision maker or controller Action = Decision or control Environment = Dynamic system (plant) State = State of the system State value function = (opposite of) cost function Reward of the state = (opposite of) cost of the state In RL the environment is generally thought of as a sort of black box. RL vs Control We want to fly a quadrotor floating at 1 meter off the ground to 2 meters off the ground. We increase the propeller speed to create an upward force. Propeller speed is an input u (action). The state is: position+ velocity (in a vector x). RL vs Control The dynamics then follow Newton’s laws is a deterministic function that tells us what the next state will be given the current state, the current input, and an error signal Linear Control The quadrotor dynamics is linear, and of kind : in which the state is Optimal control has for a goal to find a set of inputs that minimizes some objective Linear Control The Linear Quadratic Regulator Generic form: Due to noise, its stochastic version reads: in which RL for control The quadrotor dynamics is linear, and of kind : in which the state is RL has for a goal to find a set of actions that maximize cumulative reward given current reward. RL vs Control Optimal control Not the same function!! Reinforcement learning assuming that we don’t know the model of the system. Why RL and not Control? Control can fit a dynamics model after running experiments to explore what the system is capable of under different input settings. There are usually only a few parameters needed for actual control, and these can be estimated individually from simple calibration experiments. RL can do that but also more complicated systems for which might be hard to even write out a parametric model in a closed form. In this case, an alternative approach is to disregard models altogether. could also be helpful if the model somehow changes over time due to some sort of unmodeled drift But also note that this purely reactive RL approach to control disregards crucial information about time-dependencies in the inference task and also demands that you disregard 400 years of enlightenment physics in your control design plan. RL for Optimal Design Optimal truss design: is a multiobjective optimization problem to find a Pareto set by simultaneously solving volume minimization and maximum stress endurance. RL for Optimal Design Reinforcement learning reformulation: Node and member state data When the truss violates stress or displacement constraints, reward of -1 is provided, and removal is terminated. Otherwise, greedy or random member is removed, and reward is evaluated: RL for Optimal Design RL for Optimal Design What to expect in this course? What to expect in this course? CV ML Project RL Gain expertise in RL in order to be able to transfer it to Industry 4.0 case scenario. Conclusions In this lecture we have studied the definition of reinforcement learning elements of RL and the state value function types of RL relationship between RL and optimal control Relationship between RL and design In the next lecture, we will start with basic idea behing of RL learning, and hence with the Markov chain decision problems. Any questions? Reinforcement learning for control and design optimization Lecture 2 BOJANA ROSIC AMDA GROUP, ET In this lecture 2 we will talk about: Defining value of state as conditional expectation Probability theory: conditonal probability Probability theory: conditional expectation Markov Chain Process Markov Reward Process Ivan Petrovich Pavlov Markov Decision Process wiki Recap: RL 3 State, s Action, a Reward, r Agent Environment New state, s RL elements Environment (Current) State Action Reward Goal: We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. Motivating example 5 A pole hinged on a cart which must be moved in order to keep the pole in vertical position. environment State 1 State 2 State RL elements 6 move cart and achieve vertical position we can apply force (action): Action is force: Reward: cos Judge state by its value: in which we have cumulative reward: Random policy 7 We can try random forces. What “random” means? Can we find optimal policy by playing random? What is the value of state in such a case? Short intro to Probability Theory 8 Probability theory Probability space 9 is a formal model of a random phenomena also known as “experiment” or our “world”, “universe”. It is defined as a triple The random variable in this example is discrete. Sigma algebra 10 If A is an event, then its complement (the rest of space) will be an event as well. If A and B are two events, then their union and intersection will also be events. Probability measure 11 Normalized area of full circle =1 If we judge probability measure as a portion of partial area compared to full area then we have: Probability measure 12 Disjoint events 0 1/3 2/3 1 Example of discrete random variable in RL 13 Action as a discrete random variable Random policy 14 We can try random discrete forces. What is the value of state in such a case? Expectation 15 Fresh But, this is not conditional expectation!! Conditional expectation 16 Fresh This is same as in our case: Probability 17 Event A: all odd numbers Event B: all square numbers 8 5 3 7 9 6 2 4 Probability 18 All square and odd numbers 8 5 3 7 9 6 2 4 Probability 19 Imagine that we know and question is if we can compute i.e. probability of square numbers? This number defines probability of B no matter if A or A and B occurred. 8 5 3 7 9 6 2 4 Conditional Probability 20 In general, we search for probability of B given A If A occured, then B will occur only if occurs. Hence, 8 5 3 7 9 6 2 4 Conditional Probability 21 Note that due to independence. Otherwise, 8 5 3 7 9 6 2 4 Conditional Probability 23 Draw ball from the bag without replacement: Event A – draw red ball, return the ball Event B - draw again red ball Conditional Probability 24 Draw ball from the bag without replacement: Event A – draw red ball, keep the ball Event B- draw again red ball Conditional Probability 25 Conditional probability of B given A: Conditional probability of A given B: Joint must be same: The last two relations give us the famous formula also known as Bayes’s rule. Conditional expectation 26 Fresh This is same as in our case: Conditional expectation 27 Fresh Hence for the discrete case we can use this formula: which then can be also used for estimating the value of the state in RL if our action is discrete random variable However,… 28 Discrete force is not so realistic. How we know we need 50N? Maybe 45.6N is better? As we do not know, we need to assume force to be continuous and probabilistic (uncertain, we need to vary it). Continuous probability measure 29 Force can be any number between -50 and 50 N. Continuous probability measure 31 How to assign probabilities to particular numbers representing force? Continuous probability measure Continuous probability measure Continuous probability measure Problem: how to measure length if we cant measure a point? Continuous probability measure We know that if F is a random variable in a probability space. Further we may compute the unit length by from which we may obtain the probability density function: Uniform distribution In general case: Uniform distribution Is there any other choice for probability density function ? Probability density function Gaussian distribution Force F Gaussian distribution Probability density function Force F Non-zero probability!!! Non-zero probability!!! Hence, force is not bounded! Some of choices 40 Bounded but non- infromative Positive-definite uniformly Bell curve lognormal Not bounded and informative Random policy 41 We can try random forces. How to compute conditional expectation in case of continuous random variable? Conditional expectation 42 In discrete case we had: In contrast to discrete case, one may compute expectation in continuous case as: Joint Probability Density Volume is 1 Describes dependence between X and Y. Marginal Probability Density Marginal of X Marginal of Y (area=1) (area =1) Describes individual change of X and Y. Conditional Probability Density Describes individual change of X and Y. Conditional Probability Density 46 Just by cutting we don’t get area =1, and we need one for conditional density. Conditional Probability Density 47 Normalization Bayes rule Conditional Probability Density 48 Back to RL 49 How to compute conditional expectation? which now can be used for So we are now ready for RL 50 to start estimating: by various methods proposed in RL. Cumulative reward 51 The discounted cumulative reward (also known as discounted return): Instead of classical R, we more often focus on The idea is to focus agent more on the current reward and less on the future ones. For this we define discount factor: But note that this factor will influence our optimization problem, and hence the choice of actions: If equal to zero, we have instant gratification If close to one then we only pay attention to future rewards. Continuous versus finite RL 52 Instead of classical R, we more often focus on Episodic RL problem Continuous RL problem RL assumptions 53 Instead of classical R, we more often focus on In order to study maximization of rewards, we need to assume something about an environment for reinforcement learning: - The environment is fully observable (When an agent can determine the state of the system at all times, it is called fully observable. For example, in a chess game, the state of the system, that is, the position of all the players on the chess board, is available the whole time so the player can make an optimal decision. - The current state completely characterizes the dynamic process 54 Markov chain process Markov chain process 55 Probability of current state given previous one: captures whole history. In other words you don’t need to know complete history in order to know current state. Hence, Markov process is memoryless. “The future depends on the past through the present” The transition from one state to another in time is defined by: Andrey Markov in which each row sums up to 1. Counter-examples 56 Instead of classical R, we more often focus on The environment is partially Non-Markovian in case of sensor-error, delays, observable friction, etc. Markov chain process 57 Samples of chain: Transition matrix: Markov chain properties 58 Absorbing state: 3 (transition probability to any other state is 1 0). 0.3 0.4 Positive probability: probability bigger than zero, e.g. 1-2. 0.6 1 3 2 Reachability: we can reach state j from state i, e.g. 2 from 1, but 0.4 not from 3. 0.3 Two states are communicative: if we can reach state j from state i, and vice versa e.g. 2 from 1. State i is transient: if we can reach state j from state i, but not vice versa e.g. from 2 to 3, but not from 3 to 2. If state is not transient then it is recurrent state. Markov chain properties 59 State i is periodic if all paths leading from i back to state i have a length that is multiple of k. Absorbing states are aperiodic. 1 1 0.3 0.4 0.6 1 3 2 0.3 2 3 0.4 Markov chain properties 60 Markov chain is irreducible if it is possible to get to any state from any state in a finite time. 1 1 0.3 0.4 0.6 1 2 3 3 0.3 2 0.4 Irreducible Not irreducible Markov chain properties 61 If all states in the chain are recurrent, aperiodic and communicate with each other (irreducible) then the chain is said to be ergodic. 1 1 0.3 0.4 0.6 1 2 3 3 2 0.4 Not Not ergodic ergodic. Markov chain properties 62 Ergodic Markov chains have unique steady state (which is also called stationary distribution). State transition 63 State transition 64 State transition 65 State transition 66 State transition 67 State transition 68 State transition 69 Ergodic Markov chains have unique steady state (which is also called stationary distribution). 70 Markov reward process Markov reward process 71 A Markov Reward process is a Markov chain with a reward function. It is defined by a tuple: The Return is the total discounted reward from time-step t Thus, we have for State value function A value function specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Long-term value State of not so State of high Win State high reward reward A state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. State value function A value function specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Long-term value State of high State of not so reward high reward A state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. State State value value function function 74 The first thing we need to ask ourselves is - what is the quality of the current state? To judge the state we need to know how big reward can we expect and define the value of state: State State value value function function 75 State State value value function function 76 The value of state described by: is known as Belmann equation. Here, we may conclude that the value of the current state depends on: - immediate reward - discounted value of the successor state - and the transition probability. Richard Ernest Bellman State State value value function function 77 Richard Ernest Bellman Conclusion State value function 78 In this lecture we have studied Definition of policy Definition of a value of the state Markovian Dynamics Markovian Dynamics System with Reward Markovian Decision Process Any questions? Reinforcement learning 1 Mark Vlutters ([email protected]) Reinforcement learning Dynamic Programming Part 1 2 Mark Vlutters ([email protected]) Questions to you! True / False? In a Markov Decision Process, when we need to decide what to do “now”, we don’t care about the past. 3 Questions to you! True / False? In a Markov Decision Process, when we need to decide what to do “now”, we don’t care about the future 4 Questions to you! True / False? In a Markov Decision Process, we can stay in the same state. 5 Questions to you! True / False? When a robot walks through a field of grass, then the robot is the “agent” and the field is the “environment” 6 Questions to you! True / False? To realize desired behavior in an agent, we can formulate an objective function describing a “cost”, and then have the agent minimize its costs. 7 Overview Agent Observed state Environment Policy Input: observed state Output: the “best” action 8 Overview Agent Observed state Environment Policy Input: observed state Output: the “best” action Objective function 9 Objective function Core of the resulting behavior – Defines rewards (or costs) for a given action in some environment Must capture the intended objective (goal)! – e.g. get as close as possible to … >...reward small distance from… – e.g. get as fast as possible >...reward short times to... 10 Overview Agent may have limited knowledge Agent Observed state Environment Policy Input: observed state Output: the “best” action Objective function 11 Overview Agent may have limited knowledge Agent Observed state Environment Policy Input: observed state Output: the “best” action A “better” policy may exist once we learn more about Objective function 12 the environment. Typically, in RL... We want to approximate the optimal policy – Best possible decisions given all possible knowledge The agent may not have all possible knowledge, hence it must learn from interacting with the environment observing and evaluating the outcome. 13 Dynamic Programming We want to find the optimal policy...given that everything is known about the environment 14 Overview Agent Observed state Environment Policy Input: observed state Output: the “best” action Model the environment and allowed actions as a MDP. Discrete (finite number of) states 15 Discrete (finite number of) actions Reinforcement learning Dynamic Programming Part 2 16 Mark Vlutters ([email protected]) Graph analogy To get from one environment state to the next…...an agent can take an action…...at the gain of some reward or expense of some cost…...which accumulates over over (time) steps Action State 2 2 7 1 State reward / cost 1 Action cost / reward 3 3 17 Example 1 What is the best path? 18 Max reward A to F? 6 possible states Only state rewards Assume deterministic B D 4 5 A F 0 1 C E 1 4 k=0 k=1 k=2 k=3 19 Max reward A to F? 6 possible states ABDF = 10 ACEF = 6 B D 4 5 A F 0 1 C E 1 4 k=0 k=1 k=2 k=3 20 Example 2 What is the best path? 21 Max reward A-J? Getting out of hand! Need a smarter way! B E 6 3 H A C F 2 J 0 1 2 I 1 D G 4 2 9 22 k=0 k=1 k=2 k=3 k=4 Max reward A-J? Greedy? Always pick highest? ABEIJ = 14 B E 6 3 H A C F 2 J 0 1 2 I 1 D G 4 2 9 23 k=0 k=1 k=2 k=3 k=4 Max reward A-J? Greedy? Always pick highest? ADGIJ = 16 Gre gua edy i B E ran s NO t ee T wo d to 6 3 H rk A C F 2 J 0 1 2 I 1 D G 4 2 9 24 k=0 k=1 k=2 k=3 k=4 Bellman’s optimality principle “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” – Bellman, 1957 25 Bellman’s optimality principle “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” – Bellman, 1957 Implications – The past is irrelevant for future optimal actions – The “remainder” of the optimal path is also optimal If ABCD is optimal for A to D, then BCD is also optimal for B to D – Allows backward reasoning! 26 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Dynamic programming J min{0} =0 J H min{3} =3 HJ Start at the end I min{4} =4 IJ Current state Corresponding action chain Be greedy wrt the future E min{1+V(H), 4+V(I)} =4 EHJ F min{6+V(H), 3+V(I)} =7 FIJ Maximal future reward we would accumulate G min{3+V(H), 3+V(I)} =6 GHJ B E B min{7+V(E), 4+V(F), 6+V(G)} = 11 BEHJ, BFIJ C min{3+V(E), 2+V(F), 4+V(G)} = 7 CEHJ 6 3 H D min{4+V(E), 1+V(F), 5+V(G)} = 8 DEHJ, DFIJ A min{2+V(B), 4+V(C), 3+V(D)} = 11 ACEHJ, ADEHJ, ADFIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 27 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Nothing to do in J J max{0} =0 J – No future reward exists H max{1} =1 HJ I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ, BFIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 28 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Only one choice in H and I J max{0} =0 J H max{1} =1 HJ I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ, BFIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 29 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Bellman: make use of J max{0} =0 J remainder of optimal policy, H max{1} =1 HJ which is also optimal! I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ, BFIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 30 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Bellman: make use of J max{0} =0 J remainder of optimal policy, H max{1} =1 HJ which is also optimal! I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} Greedy Cbutmax{3+V(E), wrt total future reward. =8 BEIJ, BFIJ 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 31 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- In all next (backward) steps J max{0} =0 J never have to consider H again H max{1} =1 HJ I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ, BFIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 32 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- J max{0} =0 J H max{1} =1 HJ I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 33 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- J max{0} =0 J H max{1} =1 HJ I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 34 k=0 k=1 k=2 k=3 k=4 Maximal reward A-J x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- J max{0} =0 J H max{1} =1 HJ I max{1} =1 IJ E max{2+V(H), 4+V(I)} =5 EIJ F max{2+V(H), 4+V(I)} =5 FIJ G max{2+V(H), 4+V(I)} =5 GIJ B E B max{3+V(E), 2+V(F)} =8 BEIJ C max{3+V(E), 2+V(F), 9+V(G)} = 14 CGIJ 6 3 H D max{2+V(F), 9+V(G)} = 14 DGIJ A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ A C F 2 J 0 1 2 I 1 D G 4 2 9 35 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ B E 6 3 H A C F 2 J 0 1 2 I 1 D G 4 2 9 36 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Optimal path A max(6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ – The best path to the end B E 6 3 H A C F 2 J 0 1 2 I 1 D G 4 2 9 37 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Optimal value function A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ – V(x) – The reward (or cost) that would be accumulated when performing actions according to the optimal policy, until the end. B E 6 3 H A C F 2 J 0 1 2 I 1 D G 4 2 9 38 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Reward (cost) per step A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ – R(x, u) – State-part (x) – Action-part (u), ignored so far – “Immediate reward / cost” B E 6 3 H A C F 2 J 0 1 2 I 1 D G 4 2 9 39 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Q-value A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ – Q(x, u) – The reward (or cost) that would be accumulated for taking some action u, and then following the optimal policy to the end. B E 6 3 H In X = A there are 3 Q-values, one for each possible action. A C F 2 J 0 1 2 I 1 Question: How do Q, R, and V relate? D G 4 2 9 40 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Q-value A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ Q ( x k , u k ) = R( x k , u k ) + V π * ( x k +1 ) B E 6 3 H V π * ( x k ) = max { Q ( x k , uk )} A C F 2 J 0 1 2 I 1 D G 4 2 9 41 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Optimal policy A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ – π*(x) – The function we’re trying to find. – Best action(s) to take when in state X B E 6 3 H Question: A C F 2 J What function is the optimal policy here? 0 1 2 I 1 D G 4 2 9 42 k=0 k=1 k=2 k=3 k=4 Notation x V(x) Optimal path x to J --------------------------------------------------------------------------------------------------- Optimal policy A max{6+V(B), 1+V(C), 2+V(D)} = 16 ADGIJ π *( x k ) = argmax u { Q ( x k , u k )} B E 6 3 H...where k is current (time) step A C F 2 J 0 1 2 I 1 D G 4 2 9 43 k=0 k=1 k=2 k=3 k=4 Questions? A 44 Reinforcement learning Dynamic Programming Part 3 45 Mark Vlutters ([email protected]) Questions to you True / False? For a given state in a (constant) environment, there exists only 1 optimal action. 46 Questions to you True / False? When maximizing rewards, then given a state, the best action is the one with the highest associated Q value. 47 Recap Explicit start (A) and end (J) Actions: – Forced to move right, cannot stay – Forced to complete in 4 time steps B E – Deterministic 6 3 H A C F 2 J Rewards / costs: – Only state rewards 0 1 2 I 1 D G 4 2 9 48 k=0 k=1 k=2 k=3 k=4 Generalizing No explicit start and end Actions: – Actions possible over whatever time horizon we can evaluate – (For now still) deterministic Rewards / costs: – State + action 49 Example 3 What are the best actions? For all states, all at once. 50 Simple MDP x=1 u0: 7 1 uccw: 0 ucw: 1 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 51 Simple MDP States (x=0, x=1 , x=2) x=1 u0: 7 1 uccw: 0 ucw: 1 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 52 Simple MDP State rewards (or costs) x=1 u0: 7 1 uccw: 0 ucw: 1 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 53 Simple MDP Actions (u=u0, u=ucw, u=uccw) x=1 Possible in all states u0: 7 – no state-dependent restrictions 1 uccw: 0 ucw: 1 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 54 Simple MDP Action rewards (or costs) – here assumed rewards x=1 u0: 7 1 uccw: 0 ucw: 1 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 55 Simple MDP Immediate rewards (or costs) R(x, u) x=1 u0: 7 1 Example: uccw: 0 ucw: 1 R(x=1, u=ucw) = 1 + 0 = 1 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 56 Bellman Equations MDP Bellman equations at some discrete (time) step k x=1 Q value u0: 7 Q ( x k , u k ) = R( x k , u k ) + V π * ( x k +1 ) 1 uccw: 0 ucw: 1 Optimal value ucw: 1 uccw: 0 V π * ( x k ) = max { Q( x k , uk )} u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 Optimal policy x=0 x=2 π *( x k ) = argmax u { Q ( x k , u k )} 57 Goal Given all info in the MDP: Compute optimal path(s) for all states simultaneously – Over k time steps Find the optimal policy: – Given any state, what should the agent do? 58 How? If known: – The system dynamics with all (discrete) states and state transitions – All state and action rewards (or costs) Then for all states… – Start at the last timestep, with V(xk+1) = 0 (no future reward exists) – Move back in time: – Use known transitions, rewards, and V(xk+1) to compute Q(xk) – Use Q(xk) to compute V(xk) and π*(xk) – Repeat for k time steps, or until converges 59 Simple MDP Known: u0 ucw uccw x=1 4 2 0 x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 uccw: 0 1 ucw: 1 u0: 7 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 60 Simple MDP NOTE: I’m using angle brackets. This is not a mathematical matrix, but an array (0-based indexing). Known: Also, state names correspond to index in array. u0 ucw uccw x=1 4 2 0 x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 uccw: 0 1 ucw: 1 u0: 7 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 61 Simple MDP Start with V(xk+1) = 0 x=1 0 x=0 ⟨⟩ V (x k +1 ) = 0 0 x=1 x=2 uccw: 0 1 ucw: 1 u0: 7 ucw: 1 uccw: 0 u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 x=0 x=2 62 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) x=1 =0 u0: 7 1 uccw: 0 ucw: 1 Q( x k , u k ) = R ( x k ,u k ) ucw: 1 uccw: 0 u0 ucw uccw u0: 1 u0: 2 2 uccw: 0 ucw: 0 0 4 2 0 x=0 ⟨ ⟩ Q( x k , u k ) = 8 1 2 1 2 1 x=1 x=2 x=0 x=2 63 Simple MDP Compute new V(x) and π*(x), here using max and argmax u0 ucw uccw x=1 4 2 0 x=0 ⟨ ⟩ Q( x k , u k ) = 8 1 2 1 2 1 x=1 x=2 uccw: 0 1 ucw: 1 u0: 7 ucw: 1 uccw: 0 4 x=0 u0 ⟨⟩ ⟨⟩ x=0 u0: 1 V (x k ) = 8 x=1 * π ( x k ) = u0 u0: 2 2 uccw: 0 ucw: 0 0 x=1 1 x=2 u0 x=2 x=0 x=2 64 k=N Simple MDP Rinse and repeat…...except V(xk+1) is now no longer 0 65 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) How to find correct V? Must know: what is state xk+1 after action uk in xk? 66 k=N-1 Simple MDP Known: u0 ucw uccw x=1 0 1 2 x=0 ⟨ ⟩ adj( x k , u k ) = 1 2 0 2 0 1 x=1 x=2 uccw: 0 1 ucw: 1 u0: 7 ucw: 1 uccw: 0 Example: u0: 1 – In x=1 taking action ucw u0: 2 2 uccw: 0 ucw: 0 0 leads to x=2 x=0 x=2 67 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) u0 ucw uccw 4 2 0 x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 68 k=N-1 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) u0 ucw uccw u0 ucw uccw 4 2 0 0 1 2 x=0 ⟨ ⟩ x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 adj( x k , u k ) = 1 2 0 2 0 1 x=1 x=2 69 k=N-1 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) u0 ucw uccw u0 ucw uccw 4 2 0 0 1 2 x=0 ⟨ ⟩ x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 adj( x k , u k ) = 1 2 0 2 0 1 x=1 x=2 4 x=0 k=N-1 ⟨⟩ V (x k +1 ) = 8 1 x=1 x=2 70 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) u0 ucw uccw u0 ucw uccw 4 2 0 0 1 2 x=0 ⟨ ⟩ x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 adj( x k , u k ) = 1 2 0 u0 2 0 1 ucw uccw x=1 x=2 4 x=0. 10. x=0 k=N-1 ⟨⟩ V (x k +1 ) = 8 1 x=1 x=2 ⟨ ⟩ Q( x k , u k ) =...... x=1 x=2 71 Simple MDP Use known transitions, costs, and V(xk+1) to compute Q(xk) Q( x k , u k ) = R ( x k ,u k ) + V (x k +1 ) u0 ucw uccw u0 ucw uccw 4 2 0 0 1 2 x=0 ⟨ ⟩ x=0 ⟨ ⟩ R (x k ,u k ) = 8 1 2 1 2 1 x=1 x=2 adj( x k , u k ) = 1 2 0 u0 2 0 1 ucw uccw x=1 x=2 4 x=0 8 10 1 x=0 k=N-1 ⟨⟩ V (x k +1 ) = 8 1 x=1 x=2 ⟨