Introduction to Cognitive Science PDF
Document Details
Uploaded by InstrumentalIntelligence4412
Columbia University
Tags
Summary
This document is lecture notes on introduction to cognitive science. It covers different types of learning, including unsupervised, supervised, and reinforcement learning, and introduces the concept of planning actions.
Full Transcript
[Auto-generated transcript. Edits may have been applied for clarity.] All right, we're going to get started for today. So over the last few lectures, we've been talking about ways that people could make decisions and act in the world. Things like how you might make judgments by weighing differ...
[Auto-generated transcript. Edits may have been applied for clarity.] All right, we're going to get started for today. So over the last few lectures, we've been talking about ways that people could make decisions and act in the world. Things like how you might make judgments by weighing different probabilistic signals, how you might take actions in the physical And so today we're going to talk about another area of study within cognitive science about how people can make plans of actions. I'm specifically going to be talking today about a framework called reinforce reinforcement learning, which is about how people make multi-step plans for the future. So we've talked so far in this class about unsupervised learning. So unsupervised learning are cases where there's no particular supervision, meaning You have a cognitive agent that isn't really being told particular goals that it's supposed to have for its perceptions or its actions. And so in that kind of case where you don't have a particular goal given to you, you can still do unsupervised learning, meaning you can try to detect interesting patterns in your environment. We also have talked about supervised learning. This is cases where you're getting some kind of feedback signal. So you make a decision and you immediately find out Was that decision correct or incorrect? So maybe you're trying to have created a system in your brain that can like detect a certain kind of object. You might be getting feedback right away of, are you doing a good job detecting this object or not? So supervised learning is almost always about one step kind of learning. So you make a decision, you immediately find out if it's right or wrong. And you could try to improve and make some change to that system to do better in the future. So what we're going to talk about today are cases where we're trying to learn how to solve some kind of problem or puzzle And it requires making a sequence of actions. So the important thing here for reinforcement learning is that we often don't get feedback right away. We don't get feedback for every action that we make. We might not find out till a while later whether our choices were right or wrong. Right. And so, um. When we talk about problem solving in cognitive science, this is often the kind of thing we're talking about. So when we say what the mind is trying to solve a problem or a cognitive agent is trying to solve a problem, what we mean is that The agent is trying to get the world into a certain state. So there's some kind of goal state or there's maybe multiple states that have some kind of rewards that we're trying to reach. And these rewards are not immediately accessible to us. They're somewhere that's like multiple steps into the future. So in order to get those rewards. We're going to have to carry out a sequence of actions. And we might not get that reward or get to that goal until we get to the end of that chain. And so the challenge here is that it's not obvious what the next right step is. And we're not getting supervision to tell us like, okay, we made the right first step or not. When we take that first step, we don't actually know if it was right or not until we get to the end. And so in this kind of framework, there's two general categories of strategies that you could try to use. And we're going to talk today about what these mean. So one category of strategies is that you could, based on your prior experience. try to pick the next step that tended to work out well for you in the past. This is called a model-free strategy, and we'll talk about what that means. You don't have any specific kind of plan about how your action is going to impact the world. You just have this prior experience of in the past, when I was faced with this kind of problem. These are actions that I took that tended to work out well. These are actions that I took that tended to not work out well. An alternative is that you could actually make a multi-step plan where if you have some kind of model of the situation you're in. So you have some kind of understanding of how the world works in this context. And you have some idea of what will happen when you take actions, then you can make plans even if you've never even if you've never experienced this particular situation before. And so this kind of model-based approach, this is basically trying to think forward into the future okay, if I take this step and then this step and then this step, I think that will get me where I need to go. So both of these kinds of plans are under this heading of reinforcement learning. The This reinforcement learning emerged as a topic of study, both from psychology, so people thinking about how behaviors are learned, how conditioned behaviors are learned, but also from engineering. So thinking about how you might control some complicated system. And so a lot of this came out of like controlling power plants and things where you have to constantly be taking actions as the controller of this complicated system. And how do you know if you're making the right decisions or not? And so in cognitive science, we usually use this kind of toolbox for thinking about how an agent is going to learn to make better kinds of decisions and make better kinds of plans. And there's a big section of reinforcement learning that is really about practical applications. So people that are interested in building machine learning models that implement reinforcement learning. But here we're mainly interested in thinking about reinforcement learning as a model of how human behavior or other kinds of cognitive behavior might work. So we're thinking of these frameworks as strategies that people or other cognitive might use to try to solve problems in their environment. So first I'm going to go through the terminology a little bit of what we talk about, like how we think about problems in this format of reinforcement learning. So let's imagine we have some puzzle we're trying to solve. So here is one of these tangram puzzles where you're given these different colored pieces. And then your goal in 10 grand puzzle is to figure out a way to move each of these colored pieces around to get them into this particular silhouette. So a lot of you have probably seen these. These are kind of like It's like a kid's puzzle, although some of these can actually be pretty hard for adults as well. The idea here is that we want to figure out some kind of plan where we can move these pieces into the silhouette and we can get it to look like this house. We don't have to use all the pieces either. So this has the kind of setup of a reinforcement learning problem where we have some goal that we're trying to reach. So we know what it'll look like roughly when we meet the goal, right? That when we have the pieces arranged in the house and that they make the shape of a house, that means we've achieved our goal. But it's really not obvious to us what the first step should be. And so the idea is that we have some current state of the world. So we start that we haven't put any pieces on the board yet. We just have the Scylla we're trying to put things on. So the state of the world, this is saying what is currently happening. So like, where are things what is the current state of affairs around us. And then we can think about different actions we could take. So one option is we could take that green square and we could stick it down there in the corner. That's a possible action we could take. If we took that action. it would lead to this next state where this green piece is on top of the house. Another action we could take is we could take the orange piece and we could stick it over here, or we could take the reddish piece and put it here in the middle. So the idea is that um From our current state, we have actions that could take us to any of these next states. None of these next states are the goal state. And so there's no way we can solve this puzzle in one move. That's fine, right? So then the idea is that if we take an action, it brings us to a new state. Then from that new state, we could again take more actions and more actions from there. And our hope is that there's some chain of actions that eventually leads us to a solution. So here's one solution to this problem, right? And there's a sequence of a bunch of putting pieces down that would get us to that state. And so in this kind of uh Problem solving, the only reward here is basically at the end, which is that getting to this goal state is good. And if we don't get to it, it's bad. And if we Right. We're not getting any feedback in the middle of whether the pieces are in the right spot or not. And so the way to try to, when we're coming up with a strategy for these kinds of problems, we need to think about what kinds of moves we're going to make that are going to get us to that goal. And again, we'll talk a little bit about different strategies we could have to try to get us to that goal state. So often we use this kind of spatial analogy here. So this is not actually unfolding in space. This is just saying. Here is the state of the world right now, and here's different worlds we could get to. But you can think about this kind of like a map where this is the world we're in right now and we could take actions to get to these other possible states of the world. We're trying to figure out a path through these states to get to the goal that we want. So in this particular example, the reward is just all the way at the end. So we're trying to get to a goal state where the pieces line up in the shape of a house. But you could also imagine cases where you're getting some rewards along the way. So when you're playing a video game. current state of the world is like whatever is currently on the screen basically, right? So what's the current, where is your character and what items do you have and whatever And then you could take different actions. So this is very beginning of of Mario Brothers and so you could choose to jump, you could maybe choosing to jump here would get you a coin, which would be good. So you'd get some reward right away. You get some points for doing that. You can move to the left. Which would change the world in this way. It would move your character to the left, or you could move to the right. That would be a bad decision. If you move to the right, you're going to lose a life because you're going to get bit. So there's different choices we can make, but it might, so we could try to pick the next choice that gives us the biggest reward. But that's not always the best idea, right? So in this particular case, if you try to get that coin right away, you actually end up landing right on the Goomba there and getting eaten. And so the best action here is actually not to just be greedy and try to get rewards right away. You want to make some kind of plan of what's a sequence of moves that I can make that would both get me the coin and help me avoid this enemy. And so again, this is sort of the flavor of reinforcement learning is that we're thinking about these multi-step kinds of plans. Where our goal is to try to get as many good things as possible. might be that that good thing is only at the very end, or maybe there's good things along the way we could get. And avoiding bad stuff. And we're trying to think multiple steps into the future. So the general setup is that there's some current state of the world, which is could be like current position of our body if we're taking motor actions, or it could be something about Something that's a lot more abstract, right? Like the state of a puzzle And then we can take different actions. When we take those actions, the world is going to change in some way and we might get some kind of reward. Although a lot of the states, we're not going to get any explicit reward. So we're going to keep making decisions. We're going to keep changing the world. And usually our goal here is basically just to try to collect as much reward as possible. So we're trying to figure out what are a good sequence of actions that's going to result in us getting a lot of rewards. And again, those good rewards might be many steps away from where we are right now. So as I mentioned, this is both like a a tool that we use to try to analyze behavior. It's also just like a very practically useful thing. So there's tons of research into reinforcement learning So for example, like the way that Google runs their data centers, they this their data centers have a lot of like control knobs that have to be constantly adjusted, things like How much energy do we spend pooling all of the servers in this room? You might have to divert electrical power between servers. And so they actually have an AI reinforcement learning system that does this. So this AI system is looking at the state of the data center. And so the state is things Which computers are currently being used? What's the weather forecast for the next 24 hours? So it has all this information. This goes into this AI system, which then tries to come up with a plan for adjusting the things like cooling and energy systems in the data center and feeds that back in and then is sort of constantly being checked. They have this system that, again, is trying to do things. In this case, it'd be something like minimize energy use in the data center. And so doing that requires you figuring out, should I like turn some computers on or off? Should I ramp the cooling system up or down? If it's kind of hot in the data center, but the temperature is predicted to be falling, then maybe I shouldn't turn on the cooling systems because it's going to resolve itself anyway. Should I turn off computers that are not being used or does it seem like Do we predict that there's going to be lots of people logging on right now? So there's this complicated system that, again, is trying to think multiple steps into the future of how to make these plans. Okay, so how do we actually do this? So this is just a very generic kind of framework that we could put lots of games and puzzles and problems in the world into. So what strategy can we use to pick the next action that's going to result in us meeting this overall goal, which is maximizing our total sum of rewards? So one very simple approach is called Q learning. The Q stands for quality And the idea is that we have this term called the quality of an action. The quality of the action is how much reward we would expect to get on average If we take a particular action in this particular state. So the idea is that for any particular state that we're in, and we'll give some examples of this, but for any situation we find ourselves in. What we'd like to know is what is the quality of each of the actions we could take? So the quality of each action tells us that if I take this action, here's how much future reward I'm expected to get. And so if we knew those Q values, then this makes us the the question really easy, right? We just pick the action with the highest Q value. Now, again, the important thing here is this Q value is not saying how much reward you're going to get right away if you take this action. It's saying in the long run, like until the end of this game or, you know, over the next year or something. How much reward am I going to get if I take this action right now? So I might take an action that could even hurt me right now. I might take an action that costs me money or something. But I think that in the long run, it's going to end up maximizing my rewards, right? So this queue tells us If I take this action in this state, what is the quality of that action meaning? How much future reward am I going to get? Okay, but so how do we compute these cues? So one way to do this is just through experience. So if we've especially this works well for situations that you experience a lot You could just keep track. of, okay, here are all the times I've been in this situation I've made different kinds of actions in the past. what has the quality of those actions been? Meaning on average, when I was in this situation and I took this particular action, how did things turn out for me? So let's think about how this might work in a simple game like tic-tac-toe. let's right if we have a reinforcement learning agent that's trying to play tic-tac-toe then one thing we could ask is what are the rewards in this in this particular state space, what's rewarding in the game of tic-tac-toe? What are our goals? Winning. Yep. So in tic-tac-toe, there's no rewards except for when the game ends you get a positive reward if you got three in a row, you get a negative reward if the other player got three in a row. There's no other point system in tic-tac-toe. There's no like you don't get any reward for putting pieces in particular spots. You just get this final reward if you win or not. We could also think about what the states are in tic-tac-toe, right? So if I wanted to know what my current state is in tic-tac-toe, what would I have to know? Yeah, so what X's and O's are currently on the board We'd also maybe want to know whose turn it is, although you could also figure that out from the X's and O's that are currently on the board as well, right? So the state is what is the current state of play in the game? So what does the board look like right now? And then the actions in tic-tac-toe are the possible moves I could make, right? So I can put either my X or my O into any of the available spaces. And so when we're playing tic-tac-toe, our goal is we want to try to get ourselves into a winning state where we have three of our pieces in a row. You can never do that on the first turn, right? But with a sequence of moves, you could get there. So if we want to solve this using Q learning. What we would do is we try to make a big table basically we could take All of our tic-tac-toe experiences, or maybe we take a big database of like lots of people tic-tac-toe games. And then what we could ask is, okay, in this particular state, which is the empty board, the beginning of the game. We could try to figure out what is the Q value for each possible action you could take. So one possible action we can take on this empty board is we could put an X Up here in the top left. That's one possible action. And the Q value, the quality value for that is When we put that X, if we took that action of putting the X up here in this corner. what rewards does that lead to on average? So in this case, again, the only reward is winning or losing. So we're basically just asking when I start the game with that move. How do things tend to turn out in the future? How often did I win the game if I made this as my first move? So there's a couple of important things here to try to highlight. So one is that We are doing this based on previous experience. So the only way to do this is if you've Play tic-tic-toe or at least watch someone else play tic-tac-toe a bunch of times. We're just like We're just looking at a past average of how things worked out. Another important thing is we're not like using any, there's no multi-step strategy or anything here. We're not thinking Okay, this is a good move in this corner because it prevents the other player from doing something or whatever. there's no like you can compute this even if you don't understand the rules of tic-tac-toe at all. You could still compute this. If you're watching someone play the game and you don't understand the rules. you can still compute this Q value. You could say when I see a player make this move at the beginning of the game, they usually win or they usually lose. You can compute that even if you don't understand anything about how tic-tac-toe works. So this Q value is just saying on average, did this decision work out well for me or not? Now, to really play according to this kind of quality learning strategy, we would need to know what is the quality of every possible move all nine possible moves on this first board. We'd like to know what the quality is of each of them so that we could pick the one with the highest quality, meaning the one that most often leads to you winning the game. And we'd have to also have one of these Q values for every possible state of the board. So we'd have to say, if we find ourselves in this situation. Where the board looks like this. What we'd like to know is, okay, when the board looked like this in the past. how often did I win if I played each of these squares, right? So when the board looked like this in the past. When I played an X in the center, how often did that lead to me winning? We'd like to know what the quality is of, in this case, there's seven possible moves I can make and what's the quality of each of them. And so again, we can do this if we just had a big database of tic-tac-toe games. There's not that many possible situations tic-tac-toe, especially if you use the fact that the board is rotationally symmetric, that you could just rotate the board. There's not that many situations you can get into. It's like, you know, on the scale of like tens to hundreds or something. So you could just make a table here if you've If you have lots of tic-tac-toe games and create this kind of queue values to say. in this situation, which moves tend to lead to people winning and which ones tend to them not winning. So this Q learning is an example of a model-free approach Where we're not using any knowledge about how this world of the game works. We're not planning into the future at all. We are just saying which actions tended to work out well for me in the past. And so again, you could do this even if you don't know how the game works. You could do this. even if you, well, you at a minimum have to kind of know like what the actions are that you could take and you also have to know like what a winning state is, I guess. Right. But you don't really have to know anything else about the game. So you can also compute these kind of Q values. So for more complicated games of tic-tac-toe, it can get tricky to compute these Q values because you need to You need to compute these for every possible situation you could be in, every possible state of the board. So for more complicated games, it's hard to do this, but you could at least still do this for like early positions in the game. So for chess, for example. We could compute what the Q values are for different possible first moves. So for a first move in chess, white is playing first. Here are two of the moves you could make at the beginning of chess. And so what we could do if we want to figure out which of these is better move. One thing we could do is like use our deep knowledge of chess But if we want a model-free approach here to this simple Q-learning approach. the way we'd figure out what a best a better move is, is we would just go to a big database of chess games And say people that played this first move how often did they win? What were the outcomes? People that played this first move, what were the outcomes? And so you could make a decision here based just on which of these moves, these opening moves tends to work out well for people in general. someone who plays chess, do they know which move would you rather make here on the left or the right? Yeah. Yeah, that's right. So this left move is one of the worst moves you can make. You can compete something the way they're actually competing these numbers is a little more complicated, but we can pretend these are basically Q values, which is Across all the games in our database. people that started with this move lost about 49% of the time, won about 32% of the time, and the rest were draws. Whereas this as an opener as white, most of the time white wins games that start like this. And so again, a way you could play chess, even if you don't know any of the rules or any of the strategy, if you had these Q values available to you at every possible board state, you could be a really good chess player, right? If at every board state we just knew from a big database of games, okay, when you're in this situation. These are the moves you could make and here's how well, when people tend to make this move in this situation, how well does it work out for them, right? What kind of rewards do they get in the future? Here, the rewards here is basically, right, what are the outcomes? Of course, this approach doesn't really work for chess after the first few moves. The problem is that there's many, many, many possible board states in chess. And so even with a really big database of chess games, you're very quickly going to get yourself into situations that have never happened before. So we'll talk in a little bit here about what you would do in that situation. Okay, I was going to give one more example here of this Q learning. And so Just to try to make it so that you really have no knowledge about the world. I made my own little game. It's not that exciting, but It is a game that you don't know the rules to, which is useful for this demo. Okay, so in this game, there's three buttons. The first two are red and the last one's green. And so you can push any of the buttons and We're going to try to figure out if our Q learning based strategy for this game. So right now we have no idea what's going on in this game. So we're just going to sort of pick randomly for a bit, right? And then we'll see what we can learn from our outcomes. So let's say that we, let's push button number one. So button number one is pushed in now it's lit up. We didn't get any score for that. We didn't get any rewards. Our score is still zero. Let's try pushing button number three now. Okay, so the game ended when we pushed button number three. Our final score is negative one. And so now that we've gotten to the end of the game, we could update our queue estimates, our quality estimates, right? So this is what a Q learning table would look like is In this game, we started with all the buttons not pushed. We chose to push button number one. And that, at the end of the game, that led to us having a final score of negative one. So what this key living table is telling us is that In our experience so far, when our first move in this game is to push button number one, we end up with a score of negative one at the end. When we were in this situation where the first button was pushed, I pushed button number three And again, when I push button number three, we ended up with this final score of negative one. All the other ones just have question marks. We haven't tried any of those other actions yet. We don't know the quality of any of those things yet. So this is how we're updating the table. Again, it's just like bookkeeping. Basically, we're just saying. Okay, these are the choices that we made in this game. we made this button number one action in the state. We made this button number three action in this state. And the final outcome of this game was negative one. Okay, so… All right, let's play again. Anybody? What button should we push? All right, two, so let's put question number two. All right, we got a score of one. I wish we do next. One? Okay. And then we only have one choice is three And our final score there was two. So now we're updating our table here. So in the first state, when the buttons weren't pushed, we pushed But number two, we ended up with a final score in this game of two. So again, important thing with equality score is this is not about the reward you get for pushing that button. It's saying in the long term. what is the final score that we get when we take that action in this state? So you can see we're starting to fill in our table here. We're starting to have some experience. of you know definitely at least from our experience so far, it seems like pushing button two was a better choice than button one. In this first state, right? But a lot of this table is still pretty uncertain. So we can keep doing this, right? We could try a different, so we could do like one two three That resulted in a final score of one. So, and again, we're updating our table here. You can see this value changed because We've now had two experiences of pushing that button number one at the beginning. One of them led to an outcome of negative one. The other one led to an outcome of one. So on average, the average reward we got was zero with that action. And again, we're just keeping track of what's happened so far. We're not trying to really learn the rules of the game here, although the rules are pretty simple. You could probably figure them out with a few more examples. But we're just building this table of things we've tried. We could try pushing button number three at the beginning. That doesn't work well. So that action is going to have a low quality Pushing button number three at the beginning doesn't really work. Let's see, we haven't tried this yet, two and then three. That gave us a score of zero. So even once this table is filled in, these values can still be changing as we're trying different things, right? Because it's telling us On average, when we do different combinations, what does that turn out to be? If we do this enough times. We can eventually get these kinds of qualities. scores that tell us, okay, on average of all a bunch of those things I tried in this first state we're in. It seems like the best outcome is picking two, but number two. That tended to work out the best for us in general. That'll take us into this state here. And then in that case, the best thing to do is push button number one. And then we push button number three, right? And so this is basically showing us like on average what the um what the outcomes look like for each of these combinations. And so again, if I just Do this a bunch of times, you could try to figure out what the rule is here about how the points work, but you don't really have to. This model-free approach is just saying on average, which actions did I take? Like if I took different actions. In this first state, which of them tend to work out well and which ones tend to not work out well. Okay. All right. Any questions about this Q learning approach? Okay, so as I mentioned, Q-learning is is great for, especially for simple things or for situations you encounter a lot so like for opening moves in chess, we basically can compute these kinds of Q things of which moves tend to lead to wins and which ones tend to lead to people losing or tying. but um this starts becoming a problem if we find ourselves in situations that are new, then we can't fall back on this kind of Q learning, right? We don't have enough experience. And for games like chess, that happens pretty quick. Once you're 10 moves in, you're very likely in a state that you've never experienced before. So what we could do instead is try to use a model. So instead of a model-free approach, we could use a model-based approach. Which is that this is actually more like what we call planning. That instead of just saying which actions tended to work out well in the past without thinking that much about how they actually impacted the world. Here we could try to think about, okay, if I take this action, here is how the world is going to change, or at least I have a guess of what's going to happen in the world. And then, okay, what could I do from that step? What could I do from that step, right? So this is making a plan where we have a model of the world, meaning I have some idea of when I take an action, what's going to happen in the world. I have some idea of like, if I'm going to get rewards or not. I have some idea about like what kind of sequence might get me to my goal. So this is often studied in a lab using this kind of of tests that I'll give an example of here. So let's imagine that you have two possible choices of radio stations you could turn on. So maybe you're in a new place where you haven't listened to these before. So you listen to you could Turn on the 70s gold station or you could turn on today's hits And the 70s gold station usually plays stuff From the 70s, like Fleetwood Mac or something, today's hits is usually playing pop And so. this is like most of the time when you turn them on, they do sometimes, 20% of the time they will play other things. So the today's hit station does sometimes play older music The 70s gold station does sometimes play newer music. So let's say you turn on the 70s gold station And Taylor Swift is on and you really love the song. Maybe it's a new song you haven't heard or something. And so this is very rewarding to you. You really liked the outcome here of your choice. That you were very happy with how things turned out. So the question is, tomorrow when you go to turn on the radio, what station should you turn on? So a model-free learner would say Well, yesterday we turned on the 70s gold station and that worked out really well for us. And so we should do it again that action that we took led to a good reward. And so that seems like a good thing to do. A model-based learner, though, who knows the structure of the world here knows that actually hearing Taylor Swift is actually pretty unlikely in the 70s gold station. That was kind of a fluke, right? That that's usually not what they play. So if your goal is to hear Taylor Swift, you should turn on the Today's hit station. So this is making more of a plan of like, if I turn on the today's hit station. what is going to happen to the world And so here, what we think is going to happen is that a Taylor Swift song is going to start playing, right? Or a similar song. And so if that is my goal, then that's the action that I should take, right? This is a kind of setup where these two kinds of strategies make make opposite predictions. And so it's a good way of testing The extent to which people are using model-based versus model-free learning So you could have cases where even though a certain action worked out well for you in the past. It's actually not the best decision to make right now if you know the model of the world. So another example of this is like My office is over in Scribblehorn Hall. And so one way I could get over here is by going out the front door and going this way through campus. But if I find out that they've closed the gate here over by Earl, which was happening a lot. Then if someone tells me that, that this is closed. Then I am not going to leave from the front door of Skirmhorn, even though that worked out well for me the past few times I came. I'm instead going to sneak out the back here and I can go back around this way instead. Right. And so… the fact that I could change my plan by hearing this information, this is a property of model-based systems, which is that I have some map of the campus in my head. I have some idea of what's going to happen if I take an action like walking out the front door and what's going to happen next. And so even though walking out the front door worked well for me in the past. I've learned this new information. I've changed my model, and I now know that that's actually a bad move for me to make today. And so again, a model-free learner would just blindly walk out the front door because that has worked out well in the past. The model-based learner is thinking multiple steps ahead, is thinking, okay, here is The sequences of things I could do is that sequence actually going to give me the desired result or not? So a big advantage of using a model is that you can do things like this, like you could incorporate new information about the world, even if you haven't experienced it yourself. And so, you know, like someone, if someone tells you something, you can incorporate that in your model. You could use that to make better decisions. It's also really useful in cases where you haven't actually been in that situation before. You could a model-based planner could figure out possible good moves to make even in a new situation. In computer science, there was this a few years ago when people were really into these reinforcement learning problems, there was this interesting challenge, which was that you had to try to build a reinforcement learning agent that could play a bunch of these old Atari games. And so for most Atari games. it's like pretty easy to get started in the sense that if you just take random actions, you'll like get some points and you could like figure out what the right thing is to do. But there are some games, so this is this one platforming game that was in this challenge and it Part of the game requires you to go grab this key that's on the side here, and then it opens a door in another room if you're carrying the key with you. And the problem is that it's for the model-free learners that like they basically just try random stuff and see what The problem is that it's almost impossible for them to beat this because the chance of you just accidentally picking up this key and taking it to the door is like almost zero. And so the the Idea here is that these kinds of Challenges where like the right thing to do is like very not obvious. And it's like, might be hard to make progress at all unless you have some kind of model. And so a lot of these systems just operate like on the pixels. They don't even like know what a key is, right? Like the reason that a human can do this is because you see that key and think. that would probably open a door somewhere. That's you using your model of the world that this is probably what keys do. It doesn't have to be what happens that in this particular video game, that key could be an enemy that hurts you or something, right? But you're using your model, you're making some informed guesses about what probably happens here. You don't have to purely rely on having seen someone play this game before. So these two different strategies, I've been talking about the advantages of the model-based one, but they actually both have their pluses and minuses, right? So this model free approach like quality learning is an example of a model free approach. You're just purely using your past experience. What actions tended to work out well in the long run? This is really nice that you don't have to make a model. And so especially for situations that are complicated where the true model might be really difficult to understand or discover, then you don't have to bother with that here. You're not trying to build a model We talked about this a little bit in the last class too, that at least one kind of philosophy, especially in robotics, is like, maybe we should avoid making models and just figure out Actions that tend to work out well? And this is also really fast. So if you've computed this kind of quality table of these are the good actions to take in this situation, you can make a really fast decision. Because you could say, okay, I've been in this situation a lot of times before. Here's the action that's tended to work out well. There's no planning about the future, right? And so like for the very first move in chess, for example, a computer system can make that like pretty much instantly, right? Because it just has these pre-computed values of what the good moves are. These model-based approaches where you're actually trying to simulate what is the impact of my choice is going to be. You're thinking multiple steps into the future. An advantage of this is that you can estimate this for new actions. even for a situation you've never been in, you could make some plans And also, if you learn something about a change in the world and how the world works, you can really quickly update your model. So I learned that a particular gate is closed I don't have to like go all the way to the gate and see that it's closed. If someone tells me that it's closed, I can just make a different plan. I can update my model of the world. And so it's much easier to incorporate new information. If you have a model, you could then update that model. So what do people do in real life? People use both of these strategies and it depends on what kind of choice you're trying to make. Are you trying to make a very quick choice in a very familiar environment? In that case, you're probably doing something that's pretty model free. If you're trying to make a complicated choice in a new situation that's maybe very high stakes, maybe you're going to do something that's more model based. I certainly find myself doing both of these things like I moved the drawer in my kitchen where my utensils are. And it's like a year later. And if I'm distracted, I will still go to the wrong drawer, right? I still have this model free thing of like. If you need a spoon, this is the direction you go. Even though if I think about it and use my model, I will know that's the wrong drawer, right? I just have this very strong compulsion that that was an action that worked out well for me in the past. People also use combinations of these things. And so this term heuristic search is usually what we refer to as combining these approaches. So the idea is you do some kind of model free thing to make a guess about what good actions might be. So based on your past experience in either this situation or a similar situation, we could think about, okay, these moves that worked out well in general. you generate a few of these possible actions, and then you actually use your model to think carefully about this small set of actions. So this term heuristic is saying that heuristic means like a rule of thumb that we have some idea of like these are actions that are probably would make sense in this situation. But let's use our model and actually play them out in the future. So if you're playing a game like chess, you're doing something like this, where in a given position. you can pretty quickly get an idea of like what moves would probably make sense and which ones like definitely don't make sense, right? So you could just use your general idea of like. you know in this kind of situation near the beginning of the game, you almost never want to like bring your king out to the middle of the board, right? Like that almost is never the right answer. So maybe we don't even really consider that. But among options that sometimes work out well, we can then think carefully about, okay, if I take this move. what is going to happen to the state of the game? Is that going to be good or bad for me, right? And to think farther into the future. So you do some mix of the two. You like try to quickly generate some plausible moves And then you use your model to try to simulate them out farther. Okay, so the strategies we talked about here in some sense are like guaranteed to work for any problem. So like this Q learning strategy of just figuring out what actions tend to work out well. In practice, that is the answer. You could use that strategy and make optimal decisions. But in practice, of course, it's really hard to use these strategies in real life. One kind of problem that you might have to deal with as a cognitive agent is that the state spaces, meaning the number of possible configurations of the world. could be really, really big, or it could be continuous, meaning like it's effectively infinite, meaning like If I want to move my arm, I can move it like an inch or a half an inch or a quarter of an inch. I sort of have a continuous set of choices of how to move my arm So there's effectively kind of an infinite number of situations I could be planning about. Or maybe it's not infinite, but it's really, really big. So in chess. there is a finite number of possible boards, but it is very, very big. Like the number of possible games of chess is substantially more than the number of atoms in the universe, right? So there's just no way that we're going to build a Q learning table for that. The other problem is that the set of actions we can make might be really big or infinitely continuous and so And most like board games the set of actions is pretty small, right? Even in chess, there's you know boast something right like the most maybe 20 to 30 legal moves or something like that in any position. But this could be, especially in real life situations, could be really, really big, right? So if your choice is like. you know, how should you spend your time for the hour after when class ends today? How should you spend that time? there is like a very, very large number of things you could do. Or in some sense, almost like a continuous number of things you could do if you think about all the options. And so if you're trying to plan in a very general way, right, this starts getting really complicated to think about How would you actually estimate the quality of all of those possible things you could do? Another big challenge in reinforcement learning problems is that the rewards or goals you're trying to get to might be really far away from where you are right now. And so this makes things like model-based planning really difficult that if you're trying to figure out how do I get from here to my goal state. it might be really hard to simulate even one possible path that would get you all the way there. And so in practice, when you compute things like qualities, we actually don't usually think about the sum of all possible rewards into the future. That's really difficult to keep track of. What we actually do is keep track of all rewards up to some certain time limit in the future. Or we actually use a kind of like soft horizon, meaning things that are farther and farther in the future, we care less and less about them. And so you sort of have to do this in some sense just to be able to make reasonable decisions at all that you know, if you're playing a game where there's many possible moves, it's going to be really difficult to simulate all the way out to the end. But you could maybe think about, okay. I'm going to think at least some number of moves into the future, and then I have to kind of cut it off there. Like any rewards pass there I'm going to have to just estimate. I'm not going to be able to actually plan out that far. And so because of this and other issues that learning these Q values or learning a model of the world might take a lot of attempts. And so for a lot of these models like these Game engines for chess or there's things for Go now and other kinds of games. Part of the way that these models are trained is that they play themselves for a really, really long time. a lot longer, like they have much, much more experience than a human does in terms of number of games they played. That makes some of these questions a little bit easier. In cognitive science, if we're interested in how humans learn to do something, then we have to think about how humans actually deal with these kinds of problems, right? So this is a lot of what you're doing in cognitive science if you're using reinforcement learning. is thinking about how does he even deal with this? So how does he, if you have a continuous set of possible actions. How does a human actually like think about those as different options? If you're in this situation where rewards are really far away, what do you do? What people often do is, again, some kind of combination of model-based and model-free things where maybe you plan out some number of steps in the future. And then you try to make an estimate of Like, does that seem like a good state to get to or not? And so even if you can't see your way all the way to the goal, you could think, okay. This seems like a good start that would be toward the goal. You could come up with kinds of estimates of like, do I think this is going to work out or not? You could also think about cases where your Q learning table like before you filled it in, it's not just question marks, but you can actually make some kind of intelligent guess about what things you think are more or less likely to make sense. Some of these could come from your real experience. Some of these could just be biases that you have, right? So in general. People might have a bias that like If you're not sure what to do, you should like do nothing and see what happens next, right? That is like a reasonable rule that might work well in a lot of situations. So we could think about these different kinds of rules. there's a lot of work in developmental psychology, thinking about how these kinds of rules might change for young kids versus older adults. And so, you know, really young kids, for example, might often be more model free or they might be model based, but with really simple models. So there's a version of this task that um There's a version of this task. that you give to kids where instead of radio stations, it's like these rocket ships that can go to different planets. And you can see the extent to which kids are choosing the rocket ship that worked out well for them versus making a plan about which planets to visit. Another thing that we're interested in cognitive science is what happens when you are an expert in a particular domain? How does that change the way that you solve problems? And so if you have expertise in a domain, what does that mean in terms of this framework? And so it potentially means a lot of things and it might depend on the on the particular problem we're talking about here but One thing is that experts tend to be a lot better at figuring out what are the important aspects of a state to pay attention to. So although the state itself maybe has lots of different pieces, maybe some of them are actually not important. So in tic-tac-toe, for example. If you're familiar with the game, you know that Like if you rotate the board, it doesn't make any difference. And so an expert in tic-tac-toe, it's not that hard to be an expert in tic-tac-toe, but an expert in tic-tac-toe would know that the quality of all of the corner moves at the beginning has to be the same. There's no way like the top left corner and the top right corner at the beginning, there's no way those could have different quality values. They're symmetric in terms of the game, right? So an expert would know These are important elements of the state and these are not important elements of the state. So for your first move in tic-tac-toe There's really, although there's theoretically nine actions you could take, there's only three meaningfully different things you could do. You can go in the center, you can go in the corner, you can go on like the middle side. And it doesn't actually matter which one of those, like which corner you use, right? So an expert would be aware of These are the important aspects of a state. And there's actually a bunch of stuff you could ignore that's not relevant. So being really sensitive to the key aspects of a state that are going to be important for reward is something that an expert could know An expert is also might be able to estimate action qualities without doing any model-based planning, even for novel situations. So I could give you a particular situation in like a game that you're familiar with. you might be able to say like, this seems like a good situation or this seems like a bad situation, even if you've never been in it before, and even without simulating out too much in the future. You may be able to recognize particular features of a state that if your king is currently in check. In chess, that's like probably bad for you, like a bad sign for you, right? Regardless of what else is actually going on. So you can make these pretty quick estimates of what the quality of an action would be so like that in this situation, these are the things that seem like they would be good moves. you could maybe be able to directly estimate those values without actually using your prior experience and without actually thinking about the future, right? Just thinking You could recognize that even though you've never been in this particular state, you've been in a similar state and you know how to deal with that similar state. An expert also might be a lot more likely to not make decisions at all in the sense that an expert, especially in like a sports domain or like other kinds of motor action domains Where they might take an action, like a sequence of actions that is really not like model free or model based. It's just like, they're not making any decisions at all. They have just been in a situation where they've seen this one so many times that they execute an entire sequence of motor actions they're not even really paying attention to the state of the world after they start their sequence of actions. They just have this automatic sequence that they know works really well. You could also think about something like a talented musician, right? If you ask them to play a a familiar piece, they're probably not making any kind of conscious plan really about like how they're moving their fingers on a piano, for example. They're really relying on like an entirely pre-planned sequence, right? So they're not actually trying to figure out what's the highest quality move. They're not really planning out like If I move my finger here, is my hand going to be in the right position? They just have a memorized sequence of motor commands that they've already they created at some point in the past and they're just replaying them now. And so this is how you could get really fast at a lot of tasks, right? And really accurate is if you're able to replay previously made plans. So a really interesting study about this that was mentioned in the reading about expertise I was looking at, um. chess, expert chess players. This is grandmasters and other high level players. And they gave him this task, which was to try to memorize the position of chess pieces on a board. And so, um. This board on the left here, this is a real game state. So this like is a position of the board that actually came up during a real game. This is a randomized version where the pieces have been moved around to random positions. So this board in the middle did not occur during a real game. And so what they did is they gave chess experts and novices a bunch of boards to try to memorize so they show them a board, then they would give them a blank board and they have to try to put the pieces in the right spot. This is a pretty hard task. And what they found is that the experts are better at this task. And so one reason that they might be better is maybe if you're good at chess, that just means you have a really good memory in general. Maybe people with really good memories like chess or maybe like playing chess improves your memory or something But it seemed like that wasn't it because if you give people these random board positions like this one in the middle. then the experts are actually not much better than the novices. Where the experts are better is if you give them a real board position from a real game. they're able to reconstruct it better than a novice. So the interpretation, Jason Simon's interpretation of this was that what's happening is that when the expert looks at the board, they're not really keeping track of the individual pieces, right? They're keeping track of something more about like, what is the current state of the game in the sense of like. You know, right now this Knight is threatening this rook or whatever, right? So they have some idea of In this current state of the game, here is the current tension between the pieces. Right. And most of the black pieces are on the right side of the board and the white pieces are down here. They have some kind of representation, internal representation of the game that is really sensitive to the rules of the game and like the things that are important about it. It's not just that they're good at memorizing piece locations, because if you put the pieces in random spots, they can't do it. it possible that this is like a case of retrieval but like very fragmented like it's it's like they're recognizing well this is a valid position for this piece and this piece and this and like retrieving memories of when they've seen both individual pieces I think that's, yeah, I think that's the only way to think about it is that so they, Chase and Simon referred this as chunking meaning that the expert is recognizing meaningful chunks that they've seen before, right so like this particular kind of you know, configuration of the pieces, they're able to like chunk all four of these pieces or maybe these five pieces or something are like in a meaningful understandable kind of configuration relative to each other, right? And there's a certain tension between the pieces and which ones can attack each other. And so Yeah, it could be recognition. It doesn't necessarily have to be that they've seen that exact position before, but at least it like fits some kind of template that they're familiar with, right? And so if you have these kinds of templates for how parse this board work, then you're able to do that kind of retrieval. And so, yeah, sometimes I've also shown things in experiments we've used things where we like show people other kinds of video games that they're like familiar or not familiar with. If you're seeing a game you're not familiar with, it's really difficult for you to understand what those meaningful chunks are and how to remember them. So this is at least an important aspect of expertise here is that it allows people to kind of filter the world and figure out Like these are the things to pay attention to. These are the things to ignore. This is often a big problem for people learning how to drive. When you first start driving. there's a lot of stuff happening on the road. There's a lot of signs. There's people, there's cars. It can be really nerve wracking to start driving because it's hard to know like which of the parts of the state here should be relevant for my current actions. And after your more experienced driver, you have a better idea of like There's a bunch of stuff happening in the other lane that actually is not that relevant to me right now. I don't really have to keep track of that. There's really only a few things that I need to be watching out for. Or you could flag, oh, here's actually a really unusual thing that's happening right now. I should slow down, maybe do some model-based planning to think about how I should deal with that situation. So yeah, so there's a lot of work about this expertise in cognitive science. We have some work in my lab as well, looking at where we teach people this game, which is is uh sort of adapted from this game that they developed at NYU called Four in a Row. So it's tic-tac-toe but People have to get four in a row instead of three, and it's on a bigger board. And so this is kind of a good game where it's actually kind of hard to start playing it initially. You're pretty bad at it when you start, but you can become an expert over the course of a few hours. And so we've done things looking at What happens as you become more of an expert in the game? How does it change the way that you remember sequences of moves? We also show that it changes where you look on the board. So if we just have someone Watch to other people playing a game. We can measure their eye movements and actually get an estimate of how good of a player you are just by looking at where your eyes move during a game. You can see this in other people have also looked at this in things like basketball games. If you are a very experienced basketball viewer, your pattern of eye movements will tend to look quite different on the screen than if you're a novice viewer. And again, this is because of things like you have a better understanding of what are the relevant pieces of the state. You can make predictions about things that are going to come up next. You know, in this case is where you're watching something, it's not even like you're really taking actions at all, right? But you still have a better understanding of what is the current state of the game and what are things that are going to happen next? So as I mentioned, there's been a lot of interest over the past 10 years in building artificial versions of these reinforcement learning systems and so One of the big breakthroughs was in 2016, this system called AlphaGo. And so what the assistant did is it is playing the game of go which is where you put these black and white tiles on this big grid. Go for a long time was considered borderline unsolvable by AI algorithms that we've had human level, human or above level chess algorithms for a number of decades now. But Go was considered to be, it doesn't, it's not easily solvable with the same kinds of ways that you can solve reinforcement learning for chess. In part because there's way, way more possible board positions. And so what they showed here is that you can build a reinforcement learning system. It made use of human game plays. It made use of knowledge about the rules and strategies in the game, and it was able to get up to the level of a an expert human player. And then since then, this team and others have worked on extension of this method. So for example, versions of this reinforcement learning model that don't use any human data or any domain knowledge. This AlphaGo Zero model, it was able to learn Go just by knowing the rules of the game. So it was not provided any examples of anyone actually playing the game. It was just given the rules of here. Here's the allowable moves. Here's how scoring works in the game. And then what it did is basically played itself like billions of times and was able to come up with a understanding of what are good strategies in this game. These models have since been extended, so you can have this single model call off of zero where you can actually put in any set of rules into this, and it'll try to figure out a good strategy for those set of rules. So the same model can learn to play go and chess and other games. This is what we've used in my lab actually for this four in a row project is that we can put in the rules for four in a row and actually variants of the game with different rules. We can generate AI players that could play according to any set of rules. You don't need to give it examples of like what good and bad play looks like. It could just figure out on its own what the strategies are. And so the the uh most recent version of this model, which came out a few years ago, can also work on Atari games. It doesn't even need to be told the rules. So actually all it gets is just like a picture of the screen And it's told what the current score is. And that's it. And so it's not even told what the rules of the game are. It just has to try to take actions and figure out what's going on. And so this can solve all of the Atari games as well. The general structure of these models, this is outside the scope of this class, but there are some pieces of it that are relevant to the things we've talked about today. So the idea, there's a few things that these models try to learn. So one is they try to figure out what's called an embedding function. This is how you go from the current input of like looking at the board or looking at the game and figuring out what the current relevant state features are. So you try to learn like when the world looks like this, these are the things that I should be paying attention to in the state. So in this video game right now, you know. Mario's position matters a lot and where the enemies are matters a lot. There's clouds in the sky. Those positions don't have any, those don't matter at all. They can't interact with you. So you need to learn this embedding function. It's this kind of expertise function of like. what are the relevant things in the world? That's what you use to create your state. That's the state bubble S. It also, these things also learn what's called a value function, which is basically estimating these Q values. So given a state and an action, it tries to estimate like how good would that action be for this state? It's not doing any simulations. It's just a function. It's one of these neural networks, basically, where it pushes in the current state and an action. It tells you if it thinks that action would be good or bad. And then it also tries to learn what's called a dynamics function that basically tries to learn a model of what would happen if you took these actions. How would it change the state of the world? And again, for these more recent ones, this is not something that's given to the model. It's something that it has to try to learn. Like in this particular game, what happens if you make this move? And can try to make an estimate. And so it might not be able to tell you exactly what the state is both because First of all, it might have some mistakes in it where it's not totally sure what happens. And also for a lot of these games, there's some uncertainty. So if you're playing a game against another person, for example, if you're playing chess against another person, when you make a move, you're not totally sure what the person is going to do next, right? You can make a guess, but there's some uncertainty about what the state of the board is going to be by the time it gets back to you when you take an action. So there's some uncertainty in these models as well, where it's trying to make an estimate of like Here are the possible states that could come up next. So the the model is basically trying to learn all three of these things. It's trying to learn how do I define the current state in a meaningful way? how do I define these Q values and how do I define in a model? The reason it's doing both Q learning and model-based learning is it actually tries to combine these two things. So what it does is It does something very similar to heuristic search that we described before where for a given state, it tries to figure out, it uses this queue estimate function to say, okay, here are the actions that are probably good. Here's the ones that are probably bad so let's not look at those at all. For the ones that are probably good, we're going to then use my approximate model of the world to try to roll out into the future, like here is the sequence of things that would probably happen if I did this. And so we're not, again, there's some uncertainty there, both because there might be uncertainty in the model Errors in the model as well as uncertainty about what's going to happen in the world. But if you have these two pieces, you can start doing a very good job of creating these plans, right? And so I think these models are doing something that is at least sort of similar to how we think humans solve complex problems where you're trying to learn about what are relevant pieces of the state you're trying to learn about what do I think the approximate quality of different actions are? and you're trying to, and then you don't just blindly do the action that intuitively seems best. You think about it and think about if I take this action, what's going to happen in the future? So this is a really powerful combination of these kinds of approaches that At least if you have enough time to do these kinds of rollouts and you are very familiar with the game so that you know this model pretty well. Then this can be very effective. And so, yeah, these. these reinforcement learning agents now, you can apply them to at least most like computer based kinds of situations. Interestingly, the one domain where these RL algorithms still struggle a lot is in robotics. So like we talked about last week. carrying out doing this kind of planning in the real world is actually quite a bit more difficult for a lot of reasons. It's a lot First of all, the actions are a lot more complicated that a robot can take. the model of the world starts getting a lot more complicated where if you're playing chess. There's no chance that like a gust of wind is going to come and like destroy half of the pieces, but that's a very real thing that could happen in the world. The space of things that could happen is much bigger. And also in the real world, there's usually some pretty tight time constraints. So if you want to walk like a human, it's actually impossible to walk super slow like a human. You have to do it at a certain speed because you kind of fall onto your next foot. But you can't go halfway and then think about what you're going to do next. And so that means that you have to be fast enough at doing this that you could actually do things in the real world. Most of these games, right, you get to take, you know, you can take 10 seconds to think about your next move. And so people are still thinking about ways to do this. The most promising way that people do this in the real world now is actually build these little like simulators of the world. And so the model tries to learn inside a simulation of the world that's relatively accurate for how the robot actually works. And then you could try out, you could actually test this out on the real robot. So Google for the robotics division actually has like a whole warehouse of these robots. That's a little bit freaky to watch, but they where all the robots are like running this training algorithm. doing things, again, things that seem insanely simple for most people, things like How do you pick up a block on a table? That is quite a hard reinforcement learning planning problem of the right sequence of actions that lets you pick up an arbitrary block where you don't know ahead of time exactly what its shape is or exactly how heavy it is. Okay, so we talked today about this framework of reinforcement learning, right? In this framework we have states of the world. So there's current states, there's goal states. Our goal is to figure out different strategies for getting good rewards or getting to some final reward that is really good. There are model-free approaches to this where we just use our past experience either that has happened to us or that we've observed in other people about what actions in this situation tend to work out well in the long run. Or we can build an internal model of the world and we can use this to explicitly plan out these future actions. All right. Before I go, just a few reminders. There are a lot of things coming up for the class. So paper number two is due tomorrow night Because of the Thanksgiving holiday, the paper three Timeline is a bit different than the other papers. So the proposal is due next week. That's just the one paragraph thing. The full paper is due the last day plus. We've also gotten questions about the final. And so it'll be the same format as the midterm. So it's multiple choice, about the same number of questions. Let me just finish up. Let me just finish up in the back. One second in the back. Let me just finish up because there is an important choice that you have to make. Okay, so an important choice you have to make for the final is the final is now officially, the final time has finally been officially scheduled, which is good. that it's going to be here in this room, 1 to 4 p.m, December 18th. That is the official time for the final. That is like the recommended time to come take it. If you don't want to stay on campus till the 18th, the other option is that you could take the final during the last day of class On December 9th. The last day of class, if you don't choose to take the final, the last day of class is a review session, just like we had for the midterm. So you can choose if you are feeling confident and you want to just get out of campus earlier, you can just come here on December 9th to take the final. If you would like to take it at the regular time, we're going to instead have a Zoom review session during class on the 19th. We'll review stuff and then you can obviously take your finals period to review. You do get different amounts of time. Again, it's the same length as the midterm. And so we don't intend for you to more than about an hour to do it. But if you come on the official time, you do have the full three hours if you need it. Any questions about the final? Question. We'll see everybody else might say.