Self-Driving Cars Lecture 4 - Reinforcement Learning PDF

Self-Driving Cars Lecture 4 – Reinforcement Learning Prof. Dr.-Ing. Andreas Geiger Autonomous Vision Group University of Tübingen / MPI-IS Agenda 4.1 Markov Decision Processes 4.2 Bellman Optimality and Q-Learning 4.3 Deep Q-Learning 2 4.1 Markov Decision Processes Reinforcement Learning So far: I Supervised learning, lots of expert demonstrations required I Use of auxiliary, short-term loss functions I Imitation learning: per-frame loss on action I Direct perception: per-frame loss on affordance indicators Now: I Learning of models based on the loss that we actually care about, e.g.: I Minimize time to target location I Minimize number of collisions I Minimize risk I Maximize comfort I etc. Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 4 Types of Learning Supervised Learning: I Dataset: {(xi , yi )} (xi = data, yi = label) Goal: Learn mapping x 7→ y I Examples: Classiﬁcation, regression, imitation learning, affordance learning, etc. Unsupervised Learning: I Dataset: {(xi )} (xi = data) Goal: Discover structure underlying data I Examples: Clustering, dimensionality reduction, feature learning, etc. Reinforcement Learning: I Agent interacting with environment which provides numeric reward signals I Goal: Learn how to take actions in order to maximize reward I Examples: Learning of manipulation or control tasks (everything that interacts) Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 5 Introduction to Reinforcement Learning Agent Reward rt State st Action at Next state st+1 Environment I Agent oberserves environment state st at time t I Agent sends action at at time t to the environment I Environment returns the reward rt and its new state st+1 to the agent Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 6 Introduction to Reinforcement Learning I Goal: Select actions to maximize total future reward I Actions may have long term consequences I Reward may be delayed, not instantaneous I It may be better to sacriﬁce immediate reward to gain more long-term reward I Examples: I Financial investment (may take months to mature) I Refuelling a helicopter (might prevent crash in several hours) I Sacriﬁcing a chess piece (might help winning chances in the future) Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 7 Example: Cart Pole Balancing I Objective: Balance pole on moving cart I State: Angle, angular vel., position, vel. I Action: Horizontal force applied to cart I Reward: 1 if pole is upright at time t https://gym.openai.com/envs/#classic_control 8 Example: Robot Locomotion I Objective: Make robot move forward I State: Position and angle of joints I Action: Torques applied on joints I Reward: 1 if upright & forward moving http://blog.openai.com/roboschool/ https://gym.openai.com/envs/#mujoco 9 Example: Atari Games I Objective: Maximize game score I State: Raw pixels of screen (210x160) I Action: Left, right, up, down I Reward: Score increase/decrease at t http://blog.openai.com/gym-retro/ https://gym.openai.com/envs/#atari 10 Example: Go I Objective: Winning the game I State: Position of all pieces I Action: Location of next piece I Reward: 1 if game won, 0 otherwise www.deepmind.com/research/alphago/ www.deepmind.com/research/alphago/ 11 Example: Self-Driving I Objective: Lane Following I State: Image (96x96) I Action: Acceleration, Steering I Reward: - per frame, + per tile https://gym.openai.com/envs/CarRacing-v0/ 12 Reinforcement Learning: Overview Agent Reward rt State st Action at Next state st+1 Environment I How can we mathematically formalize the RL problem? 13 Markov Decision Process Markov Decision Process (MDP) models the environment and is deﬁned by the tuple (S, A, R, P, γ) with I S : set of possible states I A: set of possible actions I R(rt |st , at ): distribution of current reward given (state,action) pair I P (st+1 |st , at ): distribution over next state given (state,action) pair I γ: discount factor (determines value of future rewards) Almost all reinforcement learning problems can be formalized as MDPs 14 Markov Decision Process Markov property: Current state completely characterizes state of the world I A state st is Markov if and only if P (st+1 |st ) = P (st+1 |s1 ,..., st ) I ”The future is independent of the past given the present” I The state captures all relevant information from the history I Once the state is known, the history may be thrown away I The state is a sufﬁcient statistics of the future 15 Markov Decision Process Reinforcement learning loop: I At time t = 0: I Environment samples initial state s0 ∼ P (s0 ) Agent I Then, for t = 0 until done: rt st at I st+1 Agent selects action at I Environment samples reward rt ∼ R(rt |st , at ) Environment I Environment samples next state st+1 ∼ P (st+1 |st , at ) I Agent receives reward rt and next state st+1 How do we select an action? 16 Policy A policy π is a function from S to A that speciﬁes what action to take in each state: I A policy fully deﬁnes the behavior of an agent I Deterministic policy: a = π(s) I Stochastic policy: π(a|s) = P (at = a|st = s) Remark: I MDP policies depend only on the current state and not the entire history I However, the current state may include past observations 17 Policy How do we learn a policy? Imitation Learning: Learn a policy from expert demonstrations I Expert demonstrations are provided I Supervised learning problem Reinforcement Learning: Learn a policy through trial-and-error I No expert demonstrations given I Agent discovers itself which actions maximize the expected future reward I The agent interacts with the environment and obtains reward I The agent discovers good actions and improves its policy π 18 Exploration vs. Exploitation How do we discover good actions? Answer: We need to explore the state/action space. Thus RL combines two tasks: I Exploration: Try a novel action a in state s , observe reward rt I Discovers more information about the environment, but sacriﬁces total reward I Game-playing example: Play a novel experimental move I Exploitation: Use a previously discovered good action a I Exploits known information to maximize reward, but sacriﬁce unexplored areas I Game-playing example: Play the move you believe is best Trade-off: It is important to explore and exploit simultaneously 19 Exploration vs. Exploitation How to balance exploration and exploitation? -greedy exploration algorithm: I Try all possible actions with non-zero probability I With probability choose an action at random (exploration) I With probability 1 − choose the best action (exploitation) I Greedy action is deﬁned as best action which was discovered so far I is large initially and gradually annealed (=reduced) over time 20 4.2 Bellman Optimality and Q-Learning Value Functions How good is a state? The state-value function V π at state st is the expected cumulative discounted reward (rt ∼ R(rt |st , at )) when following policy π from state st :   X V π (st ) = E[rt + γrt+1 + γ 2 rt+2 +... |st , π] = E  γ k rt+k st , π  k≥0 I The discount factor γ < 1 is the value of future rewards at current time t I Weights immediate reward higher than future reward (e.g., γ = 21 ⇒ γ k = 11 , 12 , 14 , 18 , 16 1 ,...) I Determines agent’s far/short-sightedness I Avoids inﬁnite returns in cyclic Markov processes 22 Value Functions How good is a state-action pair? The action-value function Qπ at state st and action at is the expected cumulative discounted reward when taking action at in state st and then following the policy π:   X Qπ (st , at ) = E  γ k rt+k st , at , π  k≥0 I The discount factor γ ∈ [0, 1] is the value of future rewards at current time t I Weights immediate reward higher than future reward (e.g., γ = 21 ⇒ γ k = 11 , 12 , 14 , 18 , 16 1 ,...) I Determines agent’s far/short-sightedness I Avoids inﬁnite returns in cyclic Markov processes 23 Optimal Value Functions The optimal state-value function V ∗ (st ) is the best V π (st ) over all policies π:   X V ∗ (st ) = max V π (st ) V π (st ) = E  γ k rt+k st , π  π k≥0 The optimal action-value function Q∗ (st , at ) is the best Qπ (st , at ) over all policies π:   X Q∗ (st , at ) = max Qπ (st , at ) Qπ (st , at ) = E  γ k rt+k st , at , π  π k≥0 I The optimal value functions specify the best possible performance in the MDP I However, searching over all possible policies π is computationally intractable 24 Optimal Policy If Q∗ (st , at ) would be known, what would be the optimal policy? π ∗ (st ) = argmax Q∗ (st , a0 ) a0 ∈A I Unfortunately, searching over all possible policies π is intractable in most cases I Thus, determining Q∗ (st , at ) is hard in general (for most interesting problems) I Let’s have a look at a simple example where the optimal policy is easy to compute 25 A Simple Grid World Example states actions = { 1. right ? 2. left reward: r = −1 for 3. up ? each transition 4. down } Objective: Reach one of terminal states (marked with ’?’) in least number of actions I Penalty (negative reward) given for every transition made 26 A Simple Grid World Example ? ? ? ? Random Policy Optimal Policy I The arrows indicate equal probability of moving into each of the directions 27 Solving for the Optimal Policy Bellman Optimality Equation I The Bellman Optimality Equation is named after Richard Ernest Bellman who introduced dynamic programming in 1953 I Almost any problem which can be solved using optimal control theory can be solved via the appropriate Bellman equation Richard Ernest Bellman Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 29 Bellman Optimality Equation The Bellman Optimality Equation (BOE) decomposes Q∗ as follows: Q∗ (st , at ) = E rt + γrt+1 + γ 2 rt+2 +... |st , at BOE ∗ 0 = E rt + γ max 0 Q (st+1 , a ) st , at a ∈A This recursive formulation comprises two parts: I Current reward: rt I Discounted optimal action-value of successor: γ max Q∗ (st+1 , a0 ) 0 a ∈A We want to determine Q∗ (st , at ). How can we solve the BOE? I The BOE is non-linear (because of max-operator) ⇒ no closed form solution I Several iterative methods have been proposed, most popular: Q-Learning Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 30 Proof of the Bellman Optimality Equation Proof of the Bellman Optimality Equation for the optimal action-value function Q∗ : Q∗ (st , at ) = E rt + γrt+1 + γ 2 rt+2 +... |st , at   X = E γ k rt+k |st , at  k≥0   X = E rt + γ γ k rt+k+1 |st , at  k≥0 ∗ = E [rt + γV (st+1 )|st , at ] ∗ 0 = E rt + γ max 0 Q (st+1 , a )|st , at a Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 31 Bellman Optimality Equation Why is it useful to solve the BOE? I A greedy policy which chooses the action that maximizes the optimal action-value function Q∗ or the optimal state-value function V ∗ takes into account the reward consequences of all possible future behavior I Via Q∗ and V ∗ the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state / state-action pair I For V ∗ , a one-step-ahead search yields the optimal actions I Q∗ effectively caches the results of all one-step-ahead searches Sutton and Barto: Reinforcement Learning: An Introduction. MIT Press, 2017. 32 Q-Learning Q-Learning: Iteratively solve for Q∗ ∗ ∗ 0 Q (st , at ) = E rt + γ max 0 Q (st+1 , a ) st , at a ∈A by constructing an update sequence Q1 , Q2 ,... using learning rate α: Qi+1 (st , at ) ← (1 − α)Qi (st , at ) + α(rt + γ max 0 Qi (st+1 , a0 )) a ∈A = Qi (st , at ) + α (rt + γ max Qi (st+1 , a0 ) − Qi (st , at )) a0 ∈A | {z } prediction | {z } target | {z } temporal difference (TD) error I Qi will converge to Q∗ as i → ∞ Note: policy π learned implicitly via Q table! Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 33 Q-Learning Implementation: I Initialize Q table and initial state s0 randomly I Repeat: I Observe state st , choose action at according to -greedy strategy (Q-Learning is “off-policy” as the updated policy is different from the behavior policy) I Observe reward rt and next state st+1 I Compute TD error: rt + γ max Qi (st+1 , a0 ) − Qi (st , at ) a0 ∈A I Update Q table What’s the problem with using Q tables? I Scalability: Tables don’t scale to high dimensional state/action spaces (e.g., GO) I Solution: Use a function approximator (neural network) to represent Q(s, a) Watkins and Dayan: Technical Note Q-Learning. Machine Learning, 1992. 34 4.3 Deep Q-Learning Deep Q-Learning Use a deep neural network with weights θ as function approximator to estimate Q: Q(s, a; θ) ≈ Q∗ (s, a) Q(s, a; θ) Q(s, a1 ; θ),...Q(s, am ; θ) θ θ s a s Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 36 Training the Q Network Forward Pass: Loss function is the mean-squared error in Q-values:  2  rt + γ max Q(st+1 , a0 ; θ) − Q(st , at ; θ)     L(θ) = E   a0 | {z }  prediction | {z } target Backward Pass: Gradient update with respect to Q-function parameters θ: " 2 # ∇θ L(θ) = ∇θ E rt + γ max 0 Q(st+1 , a0 ; θ) − Q(st , at ; θ) a Optimize objective end-to-end with stochastic gradient descent (SGD) using ∇θ L(θ). Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 37 Experience Replay To speed-up training we like to train on mini-batches: I Problem: Learning from consecutive samples is inefﬁcient I Reason: Strong correlations between consecutive samples Experience replay stores agent’s experiences at each time-step I Continually update a replay memory D with new experiences et = (st , at , rt , st+1 ) I Train on samples (st , at , rt , st+1 ) ∼ U (D) drawn uniformly at random from D I Breaks correlations between samples I Improves data efﬁciency as each sample can be used multiple times In practice, a circular replay memory of ﬁnite memory size is used. Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 38 Fixed Q Targets Problem: Non-stationary targets I As the policy changes, so do our targets: rt + γ max Q(st+1 , a0 ; θ) 0 a I This may lead to oscillation or divergence Solution: Use ﬁxed Q targets to stabilize training I A target network Q with weights θ− is used to generate the targets: " 2 # L(θ) = E(st ,at ,rt ,st+1 )∼U (D) rt + γ max 0 Q(st+1 , a0 ; θ− ) − Q(st , at ; θ) a I Target network Q is only updated every C steps by cloning the Q-network I Effect: Reduces oscillation of the policy by adding a delay Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 39 Putting it together Deep Q-Learning using experience replay and ﬁxed Q targets: I Take action at according to -greedy policy I Store transition (st , at , rt , st+1 ) in replay memory D I Sample random mini-batch of transitions (st , at , rt , st+1 ) from D I Compute Q targets using old parameters θ− I Optimize MSE between Q targets and Q network predictions " 2 # 0 − L(θ) = Est ,at ,rt ,st+1 ∼D rt + γ max 0 Q(st+1 , a ; θ ) − Q(st , at ; θ) a using stochastic gradient descent. Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 40 Case Study: Playing Atari Games Agent ; ; ; Environment Objective: Complete the game with the highest score Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 41 Case Study: Playing Atari Games Q(s, a; θ): Neural network with weights θ Output: Q values for all (4 to 18) Atari actions FC-Out (Q values) (efﬁcient: single forward pass computes Q for all actions) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 2 ;;; Input: 84 × 84 × 4 stack of last 4 frames ; (after grayscale conversion, downsampling, cropping) Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 42 Case Study: Playing Atari Games Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 43 Deep Q-Learning Shortcomings Deep Q-Learning suffers from several shortcomings: I Long training times I Uniform sampling from replay buffer ⇒ all transitions equally important I Simplistic exploration strategy I Action space is limited to a discrete set of actions (otherwise, expensive test-time optimization required) Various improvements over the original algorithm have been explored. 44 Deep Deterministic Policy Gradients DDPG addresses the problem of continuous action spaces. Problem: Finding a continuous action requires optimization at every timestep. Solution: Use two networks, an actor (deterministic policy) and a critic. µ(s; θµ ) Q(s, a; θQ ) θµ θQ s s a = µ(s; θµ ) Actor Critic Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 45 Deep Deterministic Policy Gradients I Actor network with weights θµ estimates agent’s deterministic policy µ(s; θµ ) I Update deterministic policy µ(·) in direction that most improves Q I Apply chain rule to the expected return (this is the policy gradient): ∇θµ Est ,at ,rt ,st+1 ∼D Q(st , µ(st ; θµ ); θQ ) = E ∇at Q(st , at ; θQ ) ∇θµ µ(st ; θµ ) I Critic estimates value of current policy Q(s, a; θQ ) I Learned using the Bellman Optimality Equation as in Q Learning: h 2 i ∇θQ Est ,at ,rt ,st+1 ∼D rt + γQ(st+1 , µ(st+1 ; θµ− ); θQ− ) − Q(st , at ; θQ ) I Remark: No maximization over actions required as this step is now learned via µ(·) Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 46 Deep Deterministic Policy Gradients Experience replay and target networks are again used to stabilize training: I Replay memory D stores transition tuples (st , at , rt , st+1 ) I Target networks are updated using “soft” target updates I Weights are not directly copied but slowly adapted: θQ− ← τ θQ + (1 − τ )θQ− θµ− ← τ θµ + (1 − τ )θµ− where 0 < τ 1 controls the tradeoff between speed and stability of learning Exploration is performed by adding noise ∇θµ to the policy µ(s): µ(s; θµ ) + N Lillicrap et al.: Continuous Control with Deep Reinforcement Learning. ICLR, 2016. 47 Prioritized Experience Replay Prioritize experience to replay important transitions more frequently I Priority δ is measured by magnitude of temporal difference (TD) error: δ = rt + γ max 0 Q(st+1 , a0 ; θQ− ) − Q(st , at ; θQ ) a I TD error measures how “surprising” or unexpected the transition is I Stochastic prioritization avoids overﬁtting due to lack of diversity I Enables learning speed-up by a factor of 2 on Atari benchmarks Schaul et al.: Prioritized Experience Replay. ICLR, 2016. 48 Learning to Drive in a Day Real-world RL demo by Wayve: I Deep Deterministic Policy Gradients with Prioritized Experience Replay I Input: Single monocular image I Action: Steering and speed I Reward: Distance traveled without the safety driver taking control (requires no maps / localization) I 4 Conv layers, 2 FC layers I Only 35 training episodes Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 49 Learning to Drive in a Day Kendall, Hawke, Janz, Mazur, Reda, Allen, Lam, Bewley and Shah: Learning to Drive in a Day. ICRA, 2019. 50 Other ﬂavors of Deep RL Asynchronous Deep Reinforcement Learning Execute multiple agents in separate environment instances: I Each agent interacts with its own environment copy and collects experience I Agents may use different exploration policies to maximize experience diversity I Experience is not stored but directly used to update a shared global model I Stabilizes training in similar way to experience replay by decorrelating samples I Leads to reduction in training time roughly linear in the number of parallel agents Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning. ICML, 2016. 52 Bootstrapped DQN Bootstrapping for efﬁcient exploration: I Approximate a distribution over Q values via K bootstrapped ”heads” I At the start of each epoch, a single head Qk is selected uniformly at random I After training, all heads can be combined into a single ensemble policy Q1 QK θQ 1... θQK θshared s Osband et al.: Deep Exploration via Bootstrapped DQN. NIPS, 2016. 53 Double Q-Learning Double Q-Learning I Decouple Q function for selection and evaluation of actions to avoid Q overestimation and stabilize training. Target: DQN : rt + γ max 0 Q(st+1 , a0 ; θ− ) a DoubleDQN : rt + γQ(st+1 , argmax Q(st+1 , a0 ; θ); θ− ) a0 I Online network with weights θ is used to determine greedy policy I Target network with weights θ− is used to determine corresponding action value I Improves performance on Atari benchmarks van Hasselt et al.: Deep Reinforcement Learning with Double Q-learning. AAAI, 2016. 54 Deep Recurrent Q-Learning Add recurrency to a deep Q-network to handle partial observability of states: FC-Out (Q-values) LSTM Replace fully-connected layer with recurrent LSTM layer 32 4x4 conv, stride 2 16 8x8 conv, stride 2 ;;;; Hausknecht and Stone: Deep Recurrent Q-Learning for Partially Observable MDPs. AAAI, 2015 55 Faulty Reward Functions https://blog.openai.com/faulty-reward-functions/ 56 Summary I Reinforcement learning learns through interaction with the environment I The environment is typically modeled as a Markov Decision Process I The goal of RL is to maximize the expected future reward I Reinforcement learning requires trading off exploration and exploitation I Q-Learning iteratively solves for the optimal action-value function I The policy is learned implicitly via the Q table I Deep Q-Learning scales to continuous/high-dimensional state spaces I Deep Deterministic Policy Gradients scales to continuous action spaces I Experience replay and target networks are necessary to stabilize training 57

Self-Driving Cars Lecture 4 - Reinforcement Learning PDF

Document Details

Tags

Related

Summary

Full Transcript