Reinforcement Learning PDF

Reminders: Sign in to AttendanceRadar Quiz: Reinforcement Learning What kinds of strategies can people or AIs use to plan actions? Ø Unsupervised learning: detecting patterns in the world without a specific goal Ø Supervised learning: being taught the correct response to a stimulus Ø What about learning to solve a problem or puzzle that requires making a sequence of actions? “Problem Solving” Ø In Cognitive Science, the mind “solving a problem” usually means that: Ø There is one or more “goal”/“reward” states that we want the world to be in Ø It will require multiple steps/actions to get from where we are to where we want to be Ø It isn’t obvious what the right next step is Ø To decide on a next action, we need to either: Ø Use prior experience to choose an action that has tended to eventually get us to the goal (“model-free”) Ø Explicitly make a multi-step plan (“model-based”) Reinforcement Learning (RL) Ø Emerged in the 1970s as two lines of research merged: Ø Psychological theories of learning (classical conditioning) Ø Control theory (from mechanical engineering) Ø Useful for thinking about an agent that is making repeated decisions in an environment to achieve goals Ø RL algorithms can be practically useful (for AI systems) and also useful as explanations of human/animal behavior Possible next states … … i e ce e n p left … g re m- c e tto Current state a Pl n b o i … of the world ra n ge p ie c e Pla c e o … r i g ht o f roof on … Pla pie ce re ce in m d d ish id d … le … … m p J u Left R ig ht Next state of the world Future states of the world Reward n 1 ti o Ac Current state of the world Future states of A c ti o n 2 the world Reward Ac ti o n3 Future states of the world Goal: Maximize Reward overall sum of rewards Q(uality) Learning ØThe Quality (Q) of an action is the sum of future rewards that we’ll get (on average) if we take that action in this state Ø If we know the Q of every action in every state, we can make good decisions by just picking the action with the highest Quality Ø One way to learn Q values is just through experience: when we’ve taken this action in this state in the past, how did things tend to turn out? Tic-Tac-Toe Ø What are the rewards? Ø What are the states? Ø What are the actions? Ø Q for playing X in top-left on this board: = how often have I won in the past starting with X in top-left? X O ØQ for playing X in center on this board: = how often have I won in the past putting X in the center in this situation? For state = starting board, which action has a higher Q value? Creating a model Ø This approach of just recording experiences to learn Q values is model-free – it doesn’t require us to know how our actions take us to new states Ø But if we have a model of how our actions impact the state, we can make a plan for how to get into a “good” state rather than using trial and error ‘70s Gold Station Today’s Hits 20% 20% 80% 80% ‘70s Gold Station → → Model-free learner: Choosing the ‘70s Gold station turned out great, choose it again! Model-based learner: I should choose Today’s Hits because that is more likely to play Taylor Swift, which is what I liked Ø Model-based systems can predict which actions are high-quality even if for new states they have never experienced before Ø Model-based systems can use new information to update their models and then update their plans Ø Model-free: learn purely from experience which actions end up working out well Ø Doesn’t require knowing how states change over time or how actions impact states Ø Very fast to make decisions Ø Model-based: use knowledge about the domain to mentally simulate possible action choices Ø Can estimate the quality of novel actions Ø Can flexibly update decisions if there is a change in the way the world works Ø Most humans and AIs use heuristic search: Ø Use experience to guess which actions will have high quality for the current state (model-free) Ø Improve these guesses using some planning (model-based) Solving real problems Ø We have multiple strategies that will let us make optimal decisions! Ø Why can these strategies be hard to use? Ø State spaces can be enormous and/or continuous Ø Set of actions can be enormous and/or continuous Ø Rewards may be many steps away Ø Learning Q values or a model of the world may require a huge number of attempts Expertise in problem solving Ø An expert (human or AI) in a domain is able to: Ø Identify the most important aspects of a state Ø Estimate Q without rolling out future possibilities (even for novel states s) Ø Rely on cached, automatic sequences of actions rather than making conscious decisions Experts showed a memory benefit only for real boards “Embedding” function: extract relevant features of the state Value function: estimate Q Dynamics function: approximate model Combines a model-free value estimate with some model-based lookahead Summary Ø We can use the framework of Reinforcement Learning to describe different strategies for problem solving and multi-step planning Ø Model-free strategies: use cached knowledge about which actions led to goals/reward Ø Model-based strategies: explicitly plan out the actions that will take you to goals/reward Reminders Ø Paper #2 due tomorrow night (11:59pm) Ø Paper #3 proposal due Nov 26th, paper due Dec 9th Ø Final Ø Same format as midterm Ø On second half of the semester only Ø Choose between: Ø Monday Dec 9th 2:40-3:55pm, in class (here, 304 Barnard Hall) Ø Wednesday Dec 18th 1-4pm (here, 304 Barnard Hall)

Reinforcement Learning PDF

Document Details

Tags

Related

Summary

Full Transcript