Episode 28: Reinforcement Learning and Q-Learning
- Links to this episode: Spotify / Apple Podcasts
- This transcript was generated with AI using PodcastTranscriptor.
- Unofficial AI-generated transcripts. These may contain mistakes. Please check against the actual podcast.
- Speakers are denoted as color names.
Transcript
[00:00:00] Blue: A warning for Theory of Anything Podcasts audio listeners. This episode is mathematically really quite heavy. And the reason why is because I’m going into the theory of reinforcement learning, which requires me to go through the proofs and the mathematics to explain how it works, etc. So the audio version is going to be a little bit harder to follow and in fact I cut out the proof in the middle and I just explain it in the audio version and you don’t really get into it as far as if you go to the YouTube version where the visuals are available and it’s easier to follow. But this episode in general may benefit from actually going to the YouTube channel and viewing it rather than just listening to it. But it should still be that you can get the gist of what we’re talking about and how the theory works by being able to listen to the episode. All right, welcome to the Theory of Anything Podcast. How’s it going, guys? Great, Bruce. Happy to be here. Yeah, going great. So today we’re going to talk about reinforcement learning. And both of you have seen this presentation at one point or another. But I and I have to warn the people who listen to this podcast versus the ones that see it on YouTube, this one’s got some important visuals. So if you really want to understand what we’re talking about, you’re going to have to go to the YouTube version so that you can watch the visuals. But I am going to try to still describe things in such a way that even if you’re listening to it, you can get the gist of the theory that is being explained.
[00:01:30] Blue: And I feel like this is a really important theory to understand based on future podcasts that we’re going to be doing and talking about such as an upcoming podcast on animal intelligence. So understanding what we currently know about machine learning and reinforcement learning and how it differs from what animals do really ends up being kind of an important point. So I wanted to make sure that we did do an episode on reinforcement learning. We’ll do one today. And then we’ll also do the Alpha Go one that we’ve been talking about doing so that we’ve kind of covered the material fairly well. But I feel like the mathematics and the theory and things like that. They’re just an important part of trying to make sense of a lot of these other theories. So I feel like this is still an important topic. And so I’ve decided we’re going to go ahead and we’re going to do it, even though it may require some people going to the YouTube version. So reinforcement learning is a kind of machine learning. And if you can see the screen here from the recording, I’ve got a picture of Lisa Dahl playing Alpha Go. And this is for those of you who’ve seen the movie. Have you both seen Alpha Go, the movie?
[00:02:47] Unknown: Yes.
[00:02:47] Blue: Yes. Fantastic movie. You probably recognize that image right there. So you’ve got Lisa Dahl, who’s the world champion who is playing Alpha Go. And then you’ve got one of the programmers from the Alpha Go team who is doing the moves that Alpha Go tells him to make. And the two of them play against each other. It’s interesting. Lisa Dahl keeps trying to read the guy across from him and then remembers he doesn’t know anything. And so he’s trying to read his face, which is a natural part of playing Go. And he can’t because the human he’s playing against doesn’t know what he’s doing. It’s actually the program that he’s following that knows what it’s doing. And so there’s there’s a lot of interesting things that kind of came out of this showdown between Alpha Go and Lisa Dahl. And I highly recommend Alpha Go, the movie. It’s free on YouTube. Alpha Go utilizes reinforcement learning. And that was the big breakthrough that allowed Alpha Go to come into existence. So why is it that reinforcement learning was different than other types of machine learning that it allowed us to finally make a professional level, world champion level, Go playing program. And we’re going to talk about that today. And it’s very interesting how this theory relates to playing Go. So just as a reminder, we covered some of this in our past episodes on artificial intelligence. There’s three kinds of roughly three kinds of machine learning. There’s supervised learning, there’s unsupervised learning and there’s reinforcement learning, which combines some of the attributes of the other two, but really it’s kind of its own thing. So supervised learning just real quick. Imagine that we’re trying to make a program that
[00:04:44] Blue: takes a number of square feet, a number of bedrooms and predicts the sales price for a house. So the features would be the square feet and the number of bedrooms. And the sales price is what we’re trying to predict. So we take real values in real life. We feed in how many square feet there were, how many bedrooms there were and what the actual sales price was. In this case, it’s going to do a simple linear regression where it tries to match a line to those points that exist. And the end result is that it can now make fairly good predictions. Whereas if you feel if you take a house that’s for sale, that’s never hasn’t sold yet and you say this is the number of square feet and there’s a number of bedrooms, it will give you a prediction you should expect to say to sell it at this sales price and it will be somewhat accurate. Now, in real life, you never want to do this based on just square feet and number of bedrooms. The reason why I held it to features was because otherwise it was more than two dimensions and I wasn’t able to show it on the screen. You’d probably want to put location in there, probably a whole bunch of other features in real life. But this should illustrate exactly how supervised learning works. The key points are that you that a human gives it correct values and then it learns a model and then that model can be used to predict future values. OK, unsupervised learning is different than that.
[00:06:07] Blue: You’ve just got a whole bunch of data points and you say like in this case, K -means cluster, you’d say, give me three clusters and then it finds three clusters. So you can see the original data points, which are a big blob. And then you can say, give me three. You have to tell it how many clusters, unfortunately. But say, give it three, three clusters and then it finds three reasonable clusters out of that data there. And then like if you’re like trying to sell stuff, maybe these are clusters of customers that have similarities in purchase in how they purchase things and what their interests are or something like that, that might be valuable for recommendation engine, for example. So in this case, the key thing is that there’s nothing being input from a human. There’s no input of what the correct answer is. It’s just simply trying to find patterns in existing data as it is without a human telling it what a correct answer is or even that there is a correct answer. Now, reinforcement learning, how does it differ from those other two more kind of popular kinds of machine learning? So the characteristics that really separate reinforcement learning from other kinds of machine learning is the existence of exploration, delayed rewards and continuous learning. OK, so similar to supervised learning, there are the rewards for correct behavior work somewhat similar to giving a correct value to like in supervised learning where we were giving the value of what the house actually sold for in our example. It’s really kind of different, but you can see that there’s something analogous about it.
[00:07:45] Blue: It’s similar to what they call a loss function where it’s going to try to minimize how far off its predictions are from the correct values. It’s similar to unsupervised learning in that you have no correct result to work with. It just simply tries to learn to maximize the rewards that it’s receiving. However, it’s tempting at this point to say, oh, it’s like in between supervised and unsupervised learning, but there is actually a kind of machine learning called semi -supervised learning, which is literally in between supervised learning and unsupervised learning, where you give it a few data points from a human and then it has to learn from data that doesn’t have labels that a human put on it. So reinforcement learning isn’t really in between the two. It’s kind of its own thing. It just happens to have some characteristics similar to the other two, but it’s kind of just its own thing. So why is it that reinforcement learning was able to beat the go master, the world go champion, when other kinds of artificial intelligence failed to do so? So now this is an interesting thing. And I don’t know how much you guys, I mean, like maybe I was just more into this when I was young, but I remember way back, probably as far back as the 90s, people were talking about, like on the news, you might come across something like this. They would talk about how we’re very close to having an artificial intelligence algorithm that can beat the, the master, the chess master, who at the time was Gary Kasparov, and they, they believed we were close.
[00:09:22] Blue: Right up to that point, we had never made a chess program that could beat a world champion in chess, but they knew they felt they were close to it. And then a few years later, it happened. Deep Blue, IBM, they actually did create a program that could beat the world champion chess player. Yeah, I remember that well. Okay. And they used to always say, but go, go is some people call it Chinese chess, but they’re really nothing similar between chess and, and go, but it’s the, it’s the strategy game that’s popular, been popular in China for, I don’t know, centuries, maybe even millennia, I’m not sure how far back it goes. And they used to say it would not be possible to make a program that can beat a go master to do that. They used to always say this would require artificial general intelligence. I don’t think that term existed back there would require true artificial intelligence is what they used to say back then, because it is not something that the existing chess playing algorithms can do well. And there’s no real hope of it ever scaling up what we’re currently doing to be able to beat a world champion for go. And so there was, and I heard this like multiple times on the news back then, and they always talked about how it would require real intelligence to beat a, to be a go champion. And it just isn’t the same as chess. So let me explain why they thought that. Now, clearly the fact that now AlphaGo has beat the go champion that happened only a decade later or something, right? It wasn’t a long time.
[00:11:00] Blue: They thought it would be like hundreds of years and we would have to create a GI first. Clearly that prediction turned out to be wrong. But why were they making that prediction? For that matter, how is it that they knew that they were close to beating the world chess champion, but they didn’t think that they were anywhere close to being beating the world go champion? Well, there’s a really kind of a good reason, even though it turned out to be a false reason. And I think there’s something interesting to learn from this. So the algorithm that beats that plays chess is the mini max algorithm. And we talked about that in our artificial intelligence podcast episode, right? Basically, it just tries to play as many in, in its, you know, in its in the computer. It simulates out every possible move that it simulates out of each of those moves, what its opponent might do if it’s trying to win. And then it simulates out from each of those what it might do back. And it just tries to look as far as many moves forward as it can. Now, obviously this would be an exponential growth, okay, of different possible moves. So even with the most powerful imaginable computer, you can’t look that many moves ahead. It’s just an intractable problem. So like if I were trying to program this and you were just kind of doing a straight naive way of doing it, you’d be lucky to get, say, five moves ahead and a world champion can in their mind think more than five moves, you know, five or more moves ahead.
[00:12:33] Blue: Now, the reason why a human can do this and a computer can’t is because humans are awfully good at calling out things that don’t matter, whereas the computer has to kind of go try everything. And that turned out to be one of the main things that they did to improve chess play is they came up with algorithms that allowed the minimax algorithm to call out moves that weren’t promising so that it could start to look ahead, say 13 moves or something like that, which now at this point, no human can look ahead that far. And so now it’s going to be able to beat any human. And one of the main things that allows it to do that is not just intelligently calling things out, but just having a lot of memory. And so as computers got more powerful, they would throw more and more memory on there. They would be able to look forward further. Now, keep in mind that this is exponential growth of moves. So if you even though more laws and exponential growth of power of a computer, you only end up with linear improvements against something like the minimax algorithm, because it’s an exponential problem and exponential growth of computer power, you get linear improvement. Does that make sense? That would be the case.
[00:13:39] Green: Yeah, absolutely.
[00:13:41] Blue: So each new generation of computer can’t go that much further than the previous one. In fact, maybe you’d have to double several times before you could even go one more move further with the minimax algorithm. The other thing that they did with Deep Blue is they had a really clever board evaluation algorithm that was hand coded by a human. So what’s a board evaluation algorithm? Well, since you can’t play the game all the way to its end, that’s just exponentially impossible. What you do is you play forward a number of moves and then you evaluate the board for each of those moves and you say, OK, how good of a board position is this? And if you’ve got a good algorithm that tells you how good of a board position that is, then you don’t have to go any further. You can say, oh, I just found a really good board. But this is the best board position I’ve been able to find in my search. I’m going to make the move that gets me towards that board position. So how how might you go about coding a board evaluation algorithm for chess? Do you think? Come up with a really bad one, just something that gives a feel for how it might be done.
[00:14:47] Green: I don’t even know if I can come up with a bad one, Bruce.
[00:14:51] Blue: Are either of you into chess at all?
[00:14:54] Red: Not really. I’m not great.
[00:14:57] Blue: And I’m not either. I’m not into chess. I remember, though, when I was a kid, I’d buy books on chess just because I thought it was a cool game. And one of the things they teach you is to count a number of points per piece. So a pawn’s worth one point. And I can’t remember exactly, but it’s like a bishop and a knight is worth, I want to say, three points and a rook is worth five points and a queen is worth eight points. It’s something along those lines, right? So you have to take eight pawns to make one queen be worth it. And it’s kind of this rough, heuristic that they use to try to evaluate if a move is worth it or not. Well, you could do something like that to evaluate a board, right? You could say, well, if I make these moves, the final board, I count the number of points for one side versus my side. I’m now ahead. That’s the one I would want to make. Now, by itself, this would not be a particularly good board evaluation algorithm. It would probably work OK against an amateur. You would probably want to do more than that. You would probably want to have some way of, I don’t know, evaluating the danger that your queen is in. I mean, I don’t know exactly. One of the things is that to make this really work, there’s a trade -off that takes place. You need your board evaluation algorithm to be really fast, because if it’s slow, you can’t look ahead very many moves anymore because now you have to perform that board evaluation for every single possible move you consider.
[00:16:30] Blue: So having a quick and dirty board evaluation algorithm actually may outperform a really intelligent board evaluation algorithm. So they have to come up with something that is both intelligent and fast. And that’s what they did for Deep Blue. They came up with, and I don’t even know what it was, but it was something very clever that allowed them to evaluate boards well while still being fast enough that it could look ahead 13, 20 moves on some search pads until it was just able to beat any human. And this is how you would do a chess program. Now, why would this not work for Go? There’s a couple of reasons why. One of them is that the number of possible moves in chess is much, much smaller than the number of possible moves in Go. So the branching factor, that’s what they call it, for Go is much higher and so it explodes exponentially much faster than it does for chess. If that were the only problem, I don’t know if that would be a problem on its own. The second problem is that there’s no really well -known board evaluation algorithm. So unlike chess, where you can count the number of points for the pieces or something like that, where there’s well -known heuristics that exist, that are easily computable, Go masters use their intuition. So they look at the move and then their intuition after playing hundreds or thousands or whatever games, they’ve built an intuition for this board position is better than this one, even if they can’t explain all the time. It’s often inexplicit knowledge that can’t explain why that board position is better. And
[00:18:11] Blue: the reason why that’s necessary is because Go can switch really fast like you can have a good board position and then some other move can wreck it. And so it’s unlike chess, where you’re clearly a header behind based on which pieces you have. There’s just nothing like that in Go that you can hold on to. And so Go masters have to is basically just human intuition that you build up over time. And that was why they thought Go would require artificial general intelligence is because they thought, how in the world do you program into a computer intuition? Is that even possible to program into a computer intuition?
[00:18:49] Green: We don’t even really know what intuition is, right? I mean, that’s an interesting word because if you asked someone to describe intuition and what like what it means for us, we don’t we don’t really know that.
[00:19:05] Blue: So how would you try to describe intuition? Take a stab at describing intuition.
[00:19:09] Green: I would say it’s it’s a gut feeling that can’t be validated by any facts.
[00:19:19] Blue: OK, that’s probably I think a lot of people would agree with you on that. Do you have any difference of opinion there, Tracy?
[00:19:25] Red: No, actually, I think that was a great way to explain it. I would have to agree. So here’s the big secret that I don’t think people understood back in the 90s.
[00:19:34] Blue: Carlos Carlos Perez is an AI machine learning researcher that I’m connected to on Twitter and we talk once in a while. He has a book called Artificial Intuition. His argument is, is that machine learning isn’t artificial intelligence. It’s artificial intuition. I think that’s exactly accurate. Machine learning, as we learn from our past podcasts, we’re not quite sure how it comes up with what it does. It creates this function. It’s heuristics. It doesn’t necessarily have a good explanation for how it does what it does. But it somehow figures out how to do a computation that can say recognize a face with some high degree of accuracy. Or for that matter, you can use it to build an artificial intuition for what is a good board position in Go. And that is what they did. So the difference between AlphaGo and Deep Blue, Deep Blue is what’s known as artificial intelligence. It was there’s no learning algorithms in Deep Blue. You can let Deep Blue play itself as many times as you want. And it will never get better at playing because all it’s doing is a mini max algorithm with a clever human built board evaluation algorithm. It never changes or updates or gets better at what it’s doing. AlphaGo, its board evaluation algorithm, wasn’t coded by a human. It was machine learning, specifically reinforcement learning. And it simply developed exactly like a human intuition as to if I make this move, the board will look like this. And that’s a better board position than if I made this move. And it doesn’t have to. It’s based on, we’ll see how it works, but it’s based on machine learning. And that’s what machine learning does.
[00:21:26] Blue: It comes up with heuristics that work like this, which are very similar to the way people described human intuition. In fact, I’ve wondered if human beings, when we talk about intuition, if we don’t have something like machine learning algorithms that we utilize as that may not be true, by the way. But there seems to be something similar about the way we use our intuitions and the way machine learning works. We know there’s still some really big differences, too. Like human intuition, you think about like the number of games that a Go Master plays to gain that intuition. And it’s far fewer games than the billions of games that reinforcement learning has to play to get an equivalent intuition. So now we know humans also have explanatory knowledge, whereas machine learning doesn’t. So undoubtedly, there’s a difference there where what a human is doing is actually a combination of intuition and explanatory knowledge where they can explain some things for themselves and some things they can’t. So they can learn faster than a computer can because of that. And I’m not even sure what it is I just said. Like what I said is an accurate statement. But like, what is explanatory knowledge and how does it interact with intuition? No idea. Like, that is a huge mystery at this point.
[00:22:43] Green: Well, and you said something, you said humans use their intuition. I don’t think that that’s, you know, people don’t think about I’m going to use my intuition right now. True. If anything, it’s kind of the opposite. The intuition happens and people decide whether or not to pay attention to it. Yeah. Or whether they’re going to revert back to only dealing with the observable or or what we conceive of as being, you know, reality versus, oh, my gosh, my stomach, my gut says, I should do this contrary thing.
[00:23:21] Blue: Yes. By the way, have you guys ever heard of The Second Brain, the book The Second Brain? No. I haven’t read the book, but I had a friend who told me about it. It’s a book I probably should read. People don’t know this, but the the digestive tract has almost as many neurons as the brain. So the digestive tract actually is a second brain of sorts. And so the idea that you go with your gut might might be literally true in terms of how intuition.
[00:23:52] Red: All right, right. The second brain.
[00:23:56] Green: That’s certainly true when I’m hungry. That’s potentially interesting topic for a future podcast.
[00:24:04] Blue: Yeah, we’ll have to read that book and then talk about it. I should probably know it might also not be true. So it’s I would like to see what evidence that he comes up with and why he’s making that argument. Or if this is just one of those things where he’s come to something that’s interesting, he’s come to something that’s interesting, but there’s no real truth to it, which happens all the time. But I just thought I’d throw that in there. Just kind of an interesting side note. So this is what separates AlphaGo from Deep Blue. AlphaGo is machine learning. It learns. It learns by playing itself. Initially, it learned by looking at human games. So they would take champion human games and put it into AlphaGo and it would learn from those games. They later got rid of that and they just let it play itself so that it was learning from the ground up how to play go well. And the final version of AlphaGo, the one that is used today that can beat anybody, that it’s the version that just plays itself. It doesn’t even use input of human champion games anymore. And that machine learning component with the board evaluation algorithm, with this artificial intuition that it uses, it is so good that AlphaGo can be changed to look ahead only one move and it can play at professional levels. Probably couldn’t beat the world champion that way, but it can play at professional levels because the board evaluation algorithm is just that good.
[00:25:33] Green: Wow, that’s pretty impressive.
[00:25:35] Blue: Yeah, it is. So OK, so now how is it that reinforcement learning is able to do this? So now we’re going to talk about the actual theory. So to understand reinforcement learning, we have to talk about something called the Markov decision process. And now we really are going to be digging into just outright theory. This is the graduate level stuff that they taught me in school. OK, so they first want to define what the Markov decision process is. So let me give you kind of just a quick layman’s overview and then we’ll get into the specifics. The idea is that you take the state of the world and you don’t look at its history. You just take the state of the world as it is now. And you the the theory of Markov decision processes is that you don’t need any history to be able to tell what the next state of the world should be. Now, is this actually true? Well, we know that the laws of physics, at least the current way we think of the laws of physics, ignoring constructor theory for a moment, that the prevailing conception is that you’ve got these particles. They got an initial condition that they’re going to be moving. If you knew every particle in the universe and you knew what its momentum and whatever is, and then you would know what the next state of the universe is going to be. This is Laplace’s demon. This is something that’s very old in philosophy. So at least our current kind of Newtonian and even general relativity, the way we or even quantum physics, for that matter, the way we conceive physics today is actually as a Markov decision process. So that’s at least somewhat promising
[00:27:14] Blue: as to why Markov decision processes might be valuable. Now, in real life, though, trying to take a problem, even if, in principle, every problem could be put into a Markov decision process. Trying to actually do that in real life is hard. And there are many problems that would be very tough to solve as a Markov decision process as of today. If you think about it, though, let’s say that there was something important about your history that you wanted to track. What you can do is you can promote it to be part of the current state of the world. There’s no reason why you could always do that. So like, if you need to know what something, what was true three moves ago, you simply make part of your current state of the world a history from that piece of information from three moves ago also. So in some ways, this is why we suspect everything can be put into a Markov decision process. But it’s just it’s just very difficult. It may even be intractable in many cases. But that’s why it became of interest. Belman, the famous Belman. I don’t know if you guys have heard of him or not, but like his name comes up all over the place. He’s the one who figured out a lot of these things and the theory on how to solve the Markov decision process. Markov is a famous scientist also from Russia. So the way they describe the Markov. So the key thing about the Markov decision process, though, is the Markov assumption, which is that you’re only using the current state, none of the previous states. OK.
[00:28:43] Blue: And then so they would define that as S, which would be the states the agent can be in. And it would actually be the state of the whole world, or at least the whole world, as far as it understands the whole world. I’ll explain the difference between those in a second. A is the set of actions the agent can take. And then at time T, the agent senses current state S, T. So S is state and T would be the state at time T. Does that make sense? Nothing too fancy here yet, really, despite the notation.
[00:29:15] Unknown: Yeah.
[00:29:16] Blue: And then the agent takes action A, T, which would just be the action it takes at time step T. OK. That represents the state of the world and what is the possible actions the agent can take. The environment then gives reward R, T, which would be the reward at time T in response and then moves the agent to state S, T plus one. So you’re at time T, the state of the world is S, T. You take action A, T, you receive reward R, T. And then suddenly the state of the world is now S, T plus one, which just means the next time T, OK. For the sake of argument of making this a little more concrete, you’re a robot, you’ve decided you’re going to take action to move north. So the state of the world then gets updated to where the robot is now further north than it was before. OK. Fairly simple, actually. The environment then could be thought of as containing two functions. One would be what we call the state transition function and won’t be what we call the reward function. So the state transition function, I’ve got a little, I don’t know why they always use Greek characters, but we’ve got this little Greek character there. That’s just the function we can just call this state transition function instead. You pass to it two parameters, the state of the world at time T and the state of and which action you’re taking at time T. And it returns the new state of the world, whatever that is. OK.
[00:30:44] Blue: And then you’ve got this other function, the reward function where you pass to the reward function in the state of the world at time T and the action you take at time T is the same two parameters and it returns whatever reward it the the environment gives back to you. OK. So for the sake of argument, we have the robot. It wants to move north and when it does, it reaches its goal. So it is now in the goal state and it receives the reward for being in the goal state. Does that make sense? Yeah, absolutely. OK. All right. So that’s the Markov decision process in a nutshell. So let’s actually do a concrete example now. This is going to be helpful to have the YouTube version to see this. I’ll try to describe it, though. So we’ve got a maze. I call it a maze, but it’s not really a maze. What it really is, is it’s a three by three grid with no walls at all. And you start in the lower left hand corner and your goal is to get to the upper right hand corner. And so how would you get there? It’s a three by three. Well, one way you might move up twice and then move to the right twice or you might move to the right twice and then move up twice. OK, that’s that’s all we’re asking our robot to do is to figure out that one of those two paths is the shortest path. It can even get to the goal a bit slower if it just wanders around until it happens to hit the goal, right? And then it would still receive the reward of 100.
[00:32:07] Blue: So any path to that goal will work, although we will, as you’ll see, we’ll have a preference for how do we make it so that it finds the shortest path? So how many possible states are in this world? This is called a grid world. How many possible states are there in this world? Um,
[00:32:23] Green: see, eight, twenty eight.
[00:32:28] Blue: OK, tell me how. So I actually want to hear your thinking. How would you go about counting states? And by the way, there’s actually more than one possible correct answer. There’s one answer that’s particularly simple, though. So describe your thinking on this. How would you come? How would you come up with the number of states that this world has?
[00:32:48] Green: You you count the
[00:32:51] Blue: boxes. OK, that is by far the simplest way to do it.
[00:32:54] Red: Yeah, that’s what I was saying. Just count the boxes. How many boxes are you?
[00:32:57] Unknown: Except
[00:32:57] Red: for
[00:32:57] Blue: start.
[00:32:58] Red: So that’s how you get with
[00:32:59] Blue: eight is except for start. We would actually want to count start also. OK, so it would be nine. In fact, you may not count goal. So maybe it would be eight if you don’t count goal, but it depends on how you implement it. How many actions are there possible for our agent here? And again, there’s more than one possible correct answer here.
[00:33:16] Red: It seems like it could be limitless because it could double back on itself forever before it gets to the goal. I’m thinking.
[00:33:23] Blue: OK, let’s make this a little simpler. If you’re in any one given state, what are the possible actions?
[00:33:29] Green: You could
[00:33:29] Blue: move forward or you can move back. OK, so that’s two. Seems like you’re counting rotation. So there would have to be some way to rotate to if you can move forward and back, correct? So that you can move up into the left, into the right.
[00:33:43] Green: Oh, that’s a good point. You would you would need a step could be to turn one direction or turn the other direction.
[00:33:49] Blue: Let’s simplify this for the sake of making it simpler. You don’t need to rotate. You can just simply move from one square to one that’s next to it. How many possible moves are there now?
[00:34:01] Green: I I don’t know exactly what you mean by next. But you know, so if you if
[00:34:08] Blue: you’re at the start, how many squares are there next to you?
[00:34:12] Green: There’s two squares. Well, three squares. If you if you count the the well, two squares. Yeah, two that are
[00:34:20] Blue: directly next to you and then a third one. If you count the diagonal, OK, so now why is that there’s only three possible moves, only two if we don’t allow diagonal moves. But that’s because the start square is at the bottom. OK, let’s say you were in the middle square. How many how many possible moves are there?
[00:34:40] Green: One, two, three, four. And we’re not counting the diagonal for
[00:34:44] Blue: both ways. Give me both answers. So they’re both they’re both good answers for
[00:34:48] Green: if you don’t count the diagonal. And if you do,
[00:34:53] Blue: yeah. OK. OK, so for the sake of making this very simple, we’re going to not count diagonals, although that could be legitimate. We there’s no reason why we shouldn’t count the diagonals. It’s just we’re going to happen to program it that this agent only has four possible moves. So there’s nine possible states, four possible moves. And in fact, we’re going to make this even simpler. If if you’re in the start state, as you pointed out, if you don’t count diagonals, there’s only two possible moves. We’re going to still allow it to move in the two illegal directions. It’s just that the transition function is going to leave you in the start state. OK, so you can move down from the start state and you just wind up back in the start state again, or you can move to the left in the start state and you just wind back up in the start state again. Therefore, every single square has four possible moves. It just simplifies the programming. Does that make sense?
[00:35:40] Green: Yeah, it does.
[00:35:42] Blue: OK. So here is how we might do it. I didn’t show the ones that go off the screen. Just imagine they’re there. But here is the the maze. And here are the four possibly the four possible moves from each square. OK, so this is our states and actions put into a graphical format. Now you’ve got nine states, you’ve got four actions, and I’m showing you what it looks like on an actual grid world. How might you code that? Well, you could do something like this. You could create a table in memory and it lists out every single combination of states and actions. So state one, go right. State two, go down. State, sorry, state one, go right. State one, go down. State two, go right. State two, go left, et cetera. Right. You’d want to do all four. I’m had to fit this on the screen. So I cut out the ones that would just loop back to where you were starting. And then notice that I’m showing that if you’re in state one and you go right, you wind up in a new state, state two. That’s just a table, state equals one, action equals right, new state equals two. And this table represents the state transition function. It is the state transition function. Can you see that that is the case? Yes. Yeah. OK. So based on that, how might we make a very simple reward function? Remember that state three, the upper right hand corner is the goal. How what might a reward function look like? Use a table just like we did for the state transition function.
[00:37:15] Green: This is a lot harder than I thought
[00:37:17] Red: I was going to have to think today. I want to look at the formula again so I can try to figure out
[00:37:23] Blue: was R, F,
[00:37:25] Red: T, A, T. Well,
[00:37:26] Blue: the formula is very simple. You score 100 points if you make it to state three. Otherwise, you don’t. All right, I’ll give it to you. How about that? All right. So you just list the states and if you in state three, you get 100. Otherwise, you score a zero. Does that make sense? Is that exactly equivalent to scoring 100 if you make it to state three? It was. I
[00:37:46] Red: guess I’m not getting of state three was we are implying that it was still 100 percent based on the previous screen where it was saying state three is just arbitrarily 100.
[00:37:56] Blue: Yeah, that’s the goal you’re trying to get from state seven. The bottom left hand corner to state three, the upper right hand corner. So you score 100 by getting to state three. So that this is the reward function. It’s literally that simple. It’s just lists out nine states and it says that for each state, you get a reward of zero except for state three and you get a reward of 100. Now, in real life, this isn’t the best way to do a reward function. You’d probably want to actually do this where you do a reward of negative one for every state except for state three. And then that’s 100. The reason why you’d want to do that is because you want it to feel a little bit of a penalty for not having arrived at the goal yet so that it has natural incentive to try to find the quickest path to the goal. So in terms of just smart programming and making it so that the agent learns quicker, giving it a negative one reward for every other state except the goal state just makes sense because it’s going to learn faster. That way, for our purposes, though, that complicates the math. So I’m going to still use zero. Zero would still work in real life. It just would learn slower, but just wanted to kind of note that. You probably want to actually have a slight negative penalty for any state that isn’t the goal state. All right. How would you solve this problem? Well, what you really want is you want to create an optimal policy. So what is an optimal policy? Well, I’ve got a picture of what one looks like up on the screen. Basically, and
[00:39:27] Blue: basically what this is is it’s the it’s the three by three grid. And no matter what state you’re in, it has it tells you which direction to move so that it is the quickest way to the goal from that state. OK, so if you find whatever state you find yourself in, you always know what the best possible move is. Now, in this case, there’s actually two best possible moves. It doesn’t matter. We just want it to know at least one of the two most optimal moves. OK, so can you see that that is an optimal policy? So it knows every single state in the world and it knows exactly which is the best action in that state to try to get to the goal as quickly as possible.
[00:40:07] Red: Yeah, look at that.
[00:40:08] Blue: Yeah. Yeah. OK. The way you would create this in real life is using something called dynamic programming. Now, it’s interesting. Dynamic programming sounds like it’s some sort of type of program and computer programming. But the term dynamic programming predates electronic computers back in the early days, a lot of people don’t know this, but the term computer originally referred to a bunch of people, usually women, who would do computations. They would all stand around and have note cards and they would pass the note cards to each other and they would have a rule that they’re following and they would do the computation based on the rule that they’re following. They would do it very quickly and then they would pass the card to the next person. And basically it was a human computer. And this is how people did computer computations back before electronic computers existed. And these women would get really good at it and they could very quickly pass the cards from each other, do their computation. Each of them does one simple computation. They pass the card to the next person and they could perform all sorts of computations quite quickly by doing that. And you would end up with basically a program where you set up each of the women to have the one step they need to do. Maybe it’s like an if then statement or something like that. Each of them would perform that statement and they would do it very quickly and you’d end up with the result you needed at the very end. And that’s where the term computer comes from, is from these women who were computers.
[00:41:42] Blue: And when they started to invent the digital computer, the electronic computer, it was an analogy to what these women did, that now you have this device, this machine that can perform the same types of computations that these women could. So they started calling them electronic computers or digital computers until eventually the term only meant that because there was no longer needs for this profession anymore. It disappeared. And once that disappeared, the term came to mean only the electronic type of computer. OK, the term dynamic programming comes from the era prior to the electronic computer. So the programming involved would be with these human female computers that they would use. Does that make sense? Yeah,
[00:42:28] Green: yeah, it does.
[00:42:30] Blue: By the way, a hidden figures, the movie, I believe that’s what they were actually words, the women in that movie were the computers and they show kind of the introduction, introductory of electronic computer and they would then become the programmers of electronic computer. It’s kind of a cool movie, by the way, for anyone that’s interested in that. So dynamic programming, then, without getting into it, I’m not going to describe to you exactly what dynamic programming is. Basically, what it is is a technique. It’s not it’s a general technique that allows you to take one big problem and split it up into a bunch of little problems. The Fibonacci numbers is an example of dynamic programming. If if I want to solve for what’s the fifth Fibonacci number, this is from a past podcast, then I solve first for what the first two Fibonacci numbers are, which is just axiomatically given. And then I solve for what the third one is, and then the fourth one is, and then the fifth one is, and I save off what the answer is for all the previous ones so that I’ve got it in memory and I’ve got that available to when I get to the fifth one. That’s a very basic sort of dynamic programming. In this case, you would basically do that sort of iterative process. You would create one policy after another and then you would try to improve it until you couldn’t improve it anymore. There’s actually something called policy iteration, something called value iteration. They get the same result, but they do it in slightly different ways. Both of those are types of dynamic programming. And the end result of either is ultimately an optimal policy.
[00:44:04] Blue: Now, to be able to do dynamic programming, though, to solve the Markov decision process, you have to know the reward function and the state transition function, which in this case, we do, because I just showed you what they look like for this grid world. And if you have that state function and that reward function, you simply use them to do dynamic programming to solve the Markov decision process. And the net result is an optimal policy, exactly like what you see on the screen. And now your agent knows exactly what to do in every possible state it might find itself in, and it always makes the optimal move. Does that make sense?
[00:44:38] Red: Yes.
[00:44:39] Blue: OK. Now, the optimal policy would be based on a value function, which I’ve now shown on the screen. It’s the same three by three world. But what we’re doing is is we’re assigning a value to each state. Basically, the goal state is 100 still and every state next to the goal state is a discounted version of it. So 90 points for being on one of the two squares that’s right next to the goal state. And then a square that’s next to one of those squares is then a little less than that. So 81 for each of those. You can see I’m taking 90 percent each time, right? Until finally, you get back to the start state, which gets the lowest value. It’s 65.61. OK, because you’ve taken 90 percent, 90 percent, 90 percent. And that is what the value function looks like. OK, and basically dynamic programming, if you’re doing value iteration, it solves for that. And then it creates a policy out of it. The other way would policy iteration, you would work directly with policy, but it’s basically the same thing. How might you translate between the value function and the policy, the optimal policy, the optimal value function, the optimal policy? It’s fairly simple and nothing magic about this. What would be a very simple way of doing that?
[00:45:59] Green: I have no idea.
[00:46:02] Blue: Basically, you just move towards the state that you’re next to that’s higher than you. So if I’m in the the middle square on the top row, so my value is 90, my arrow in my policy would be the one that moves to a square that’s higher than me. Well, which one is that?
[00:46:21] Red: The 100.
[00:46:22] Blue: That’s right. So notice that that if you put an arrow from the 90 to the 100, it’s exactly the same as the arrow in the optimal policy. OK, and that’s true for all of them. Look at every single one of these. Imagine yourself in that square and then put an arrow that goes to one of the squares that’s higher than it and it will exactly match the optimal policy that I drew. As I mentioned, some of these could have a second optimal policy that’s just as optimal. You just have to pick one of them. Can you see how to translate between the two?
[00:46:51] Red: Um, yeah, you mean like 81 should have to me two arrows, one up and one right, because the next nearest values are both 90. OK, but you just have whatever direction it goes.
[00:47:03] Blue: You just have to pick one. So I picked up, but I could just as well have picked to the right. So let’s let’s make that. Let’s let’s do an example. Take a look at the lower right hand corner. What’s the value in the lower right hand corner? 81. OK, where would you draw the arrow to? Would you draw it to would you draw it to this? We’re not allowing diagonals, so there’s only two squares next to it. There’s the one above it and the one to the right of it. Which is which direction would you draw the arrow in the optimal policy
[00:47:33] Red: upwards to the box above it? That’s OK. Value is right.
[00:47:37] Blue: Now, you can do that for every single one of those squares, every single one of those squares. All you have to do is check each square next to it and then draw the arrow towards whichever one is the largest. And if you have two that are the same, then you just drop to one of them. You just pick one. Does that make sense?
[00:47:52] Green: Yes.
[00:47:52] Blue: OK, here’s the problem, though. So this is easy so far. And if we know the state transition function, then we know we can use dynamic programming to solve for an optimal policy. And then our agent will be able to perform optimally. What if we don’t know the transition function or, for that matter, what if we don’t know the reward function? Now, in this grid world, I made the world myself. I came up with the grid. I came up with what the states are. I came up with what the actions were. So, of course, I know what the the transition function and the reward function are. But in real life, like if I’m trying to, let’s say, make a stock buying program, I don’t know what the transition function is. The world contains a transition function. But I don’t know. I don’t have any idea what it is. For that matter, I may know what the rewards are. So like for a stock buying program, I know I want to make more money and I don’t want to lose money. So I can I can I don’t know what the reward function is. I don’t know if I make this purchase, what my reward is going to be. But I do after I do it, I can look at how much money did I make or lose? I can know what the reward was for that move, for that action. And but I don’t have any idea what the transition function is. What what actually causes, you know, in the stock market to go up or down, nobody knows, right?
[00:49:12] Blue: So if reinforcement learning had to use dynamic programming, then it would require that you could only use it on Markov decision processes where we happened to know the state transition function and the reward function, which is basically never except in little toy problems like this that I’m demonstrating. This would make this would make Markov decision process, solving for a Markov decision process, not particularly useful. OK. So what we really need is we need some way to solve the Markov decision process where we don’t know what the state transition function is or what the reward function is. Now, when I originally gave this presentation, I did a little aside about David Deutch since people listening to this podcast are going to know way more than this. I can just summarize this really quickly. Basically, because the real world is computable, right, as per if you’ve listened to the podcasts on Church Turing thesis, the Turing principle, the whole world is computable, right? So we basically know from that theory that everything has a state transition function, everything has a reward function. We just don’t know what they are, but they exist. Right. And that all came from Alan Turing. Based on that, we have this idea of the Turing principle. All of reality can be simulated on a universal computer because we know that it must exist. There is there must be some way to actually solve this problem. Just how do we do it? Well, that’s this is where the clever stuff comes in. Now, right here, I need
[00:50:50] Red: a second cup of coffee for this.
[00:50:54] Blue: Right here, I’ve written out the mathematics. So this probably makes no sense to you. It looks like grief.
[00:51:00] Green: This math hurts my brain just even.
[00:51:03] Blue: So I’m actually going to take the time to take you through the mathematics. And I don’t really expect you to understand it exactly. I’m trying to just get you to get the gist of it. Roger Penrose, one of my favorite authors, he talks about he’s he’s a world class physics, mathematic person, right? And he talks about how when he reads math formulas in papers and such, he doesn’t know what they mean either. And he just tries to get the gist of initially just get the gist of what they mean. It’s a lot easier to get the gist than it is to understand exactly what it’s saying. But the gist is beautiful. There’s there’s like this beautiful theorem that I’m going to try to go through here. And if anyone’s like later on, maybe someone in the OMSS program is using this, this will explain how reinforcement learning actually works and why it works. OK, so let me kind of take you through each of these parts. So let’s start with the value, reward and transition functions. So the reward function is the actual rewards. We already know how to do that. The value function is what I already showed you. It’s the same grid. The word function is the grid with 100 on the goal state, which is state three and everything else is zero. And the value function is the discounted value of being close to the goal. So the upper right hand corner is 100 because it’s the goal. The two next to it are 90. The two next to the 90s. The three next to the 90s are 81s. The two next to the 81s are 72.9s. The start state, which is the furthest away from the goal is 65.61.
[00:52:36] Blue: That’s just 90 percent off each time, taking a discount. If you’re familiar with economics, you already understand the idea of net present value and the idea of discounting value that’s in the future. Having a million dollars today is very different than having a million dollars ten years from now. So they they do a net present value to say what’s how much is a million dollars in the future worth today? And you do a discount factor. It’s the exact same idea. So the transition function is just a list of what actions move from one for one state to another. The Q function, I got to find the Q function. Q function is like a combination of the states and the actions. So the only real difference here is that instead of assigning the value to the goal of 100, I’m going to assign a value of 100 to the move that moves you into the goal state. So can you see if you can see the screen for those who are doing the YouTube version, you can see that the two squares next to the goal state, they have an arrow that goes to the gold state that arrow is now worth 100 instead of being in the state, state three, taking the action to get you to state three gets gives you the 100 instead. Do you see the difference there? It’s basically the same thing. Okay, by doing that, the Q function looks slightly different. I’m now going to call state three, the upper right hand corner goal, and I’m going to assign a value of 100 to the to the move that comes in. And then I can take an optimal value function, you basically just say take the max of which.
[00:54:12] Blue: So I’ve got a bunch of arrows, which are actions I can take from that state, take whichever action gives me the highest value. So it assigns a 100 for being next to the goal state in this case. There’s no real difference. Mathematically, it works out to be exactly the same as the way we were doing it before. It’s just a little bit easier to calculate by having it put into the Q function where we’re keeping track of states and actions together. Okay, does that make sense? That’s what a Q function is. And what an optimal value function is. So, and then finally, we have the optimal policy, which can be attained from the optimal value function based on the way we just talked about. And now we can get into the actual proof. In the audio version of this podcast, I cut out the proof that then followed. If you really are interested in this subject, you may want to visit our YouTube channel where you can find the proof in full and it’s written out and it should be a lot easier to understand than if I just give you the audio version of it. Let me explain the basics, though, of what it is I went over. There’s a series of equations, most of which are just kind of patently obvious. They simply define things. They define what a value function is. They define what a policy is in fairly intuitive ways. When you take them all together and then use those functions to substitute in with each other, you end up with something called the Q function defined in terms of itself. The idea is that a Q function, it’s a value for a state and an action inside your world.
[00:55:42] Blue: And it defines it in terms of the reward that you receive and then a discount of all future Q values that come from that state that you’ve then moved into, discounted by the number of states it is away from you. And it does that in terms of the Q function itself. By putting this together into a Q function that’s recursively defined in terms of itself, it actually sets us up so that we’re able to create a function that, you know, program it in Python or something like that. You can set it up so that the agent, the robot or whatever it is that you’re trying to train, that it simply takes rewards from the real world and then it converges those rewards. It uses that to slowly converge to the correct Q function, which is a table of states and actions, the very states that exist in the world and the actions that you can take from each state. Now, by doing this, it will actually over time. In fact, at infinity, it will converge exactly to the right values. This can be proven mathematically. In real life, you’d never go to infinity, but you can get to the point where the Q values you come up with are fairly close to what they would be if you were doing dynamic programming that gives you an exact answer. Dynamic programming, though, requires you to have the transition function and requires you to have the reward function. In this case, you don’t need either of those. So by defining it in this way, you’re getting rid of the need to have the transition function or the reward function.
[00:57:12] Blue: Now, of course, the whole goal, it was to get rid of the transition function and the reward function, because in the real world, we never know what those are. And because we don’t know what those are, we can’t use dynamic programming to solve a Markov decision process. By moving to Q learning, since we got rid of those functions and we can approximate it in a different way, we’re now able to essentially solve these problems without having to know what the transition function and the reward function of the world are. And what you come up with ultimately then is this approximation of the queue table that you need. And once you have that, it’s very easy to convert that into a policy for your agent that tells it for every single state, here’s the correct action to take. This is basically how reinforcement learning works and this is how AlphaGo works. This is how anything that deals with reinforcement learning works. The rewards simply perpetuate out from the goal state, let’s say a win in AlphaGo. Until it figures out which board states are the most likely to lead to a win in the future. So this is what I went through mathematically showed how it worked. The end result then is exactly what we wanted. It’s a general purpose learner that once you give it the states and you give it, set up the agent to pick up the rewards and you define what the rewards are that come from the real world, it will then just simply converge towards the right function that it needs. Now back to the rest of the program. This is the beautiful thing about it. It doesn’t matter if you don’t quite get it. That’s good then.
[00:58:52] Red: Thank you for saying that.
[00:58:54] Blue: The important thing here is, is that we no longer have to know the transition function. It is possible to approximate the queue function without ever knowing the transition function or the reward function. So long as you can receive rewards in real life and so long as you do these iterative updates that I talked about, where you simply throw a little extra value, you slowly move towards, you take the old value and you basically the easiest way to explain this would be whatever state you’re in, in the queue function. So you’re in the queue function in state S and you’re taking action A. You know, based on your current queue function that the best action leads to some other action. So you’re looking one move ahead and you’re doing a discount on one move ahead and then you’re throwing on whatever the reward is. You simply do 20 percent of the reward, 80 percent of what it used to what the value used to be and you sign that to the new one. What’s going to happen is it’s going to start perpetuating out the values in the queue table will slowly spread out from the goal until they hit the start state. I’ll show you an actual example of this now. I’m just kind of trying to show that how all of these are really kind of the same thing done in different formats. Right. That’s so you can see where the reds and the greens and the purples match. Now, remember that you care what we care about is not the queue function. It’s the optimal policy. But you can get from the queue function. You can get an optimal policy using the techniques that we’ve already talked about.
[01:00:25] Blue: So if we can get if we can iteratively come up with the queue function, we can then use that to convert to the optimal policy and we’re done. Right. The end result of this is that we now have a tractable way to just use natural rewards to find an optimal policy. And remember, everything under the turning principle is a function. So theoretically, this is a general learning algorithm. Now, you’ve got to be careful. We’re not talking about AGI, but theoretically, this is an algorithm that can learn any real world function so long as you can figure out how to define it in terms of a Markov decision process and you can define rewards. Then this algorithm will work on it. May not work well. May not be tractable. There’s a lot of caveats there. Now, this is why people got so excited about reinforcement learning, even though it really isn’t the most useful technique at this point. There was a guy who was researcher in reinforcement learning and he said, people always ask me if reinforcement learning would be an appropriate approach for their current problem. And he says, I always answer no, it’s not. And I know there’s a 70 percent chance I’m right because our level of understanding of reinforcement learning is still so primitive that it doesn’t it doesn’t work as well as just regular supervised learning for the most part, right? It’s much easier to define problems and solutions to problems in terms of supervised learning than in terms of reinforcement learning. The difference is that reinforcement learning can be applied to any problem and supervised learning really can’t. It always requires that you have a human that has labeled things, whereas this gets away from that.
[01:02:08] Blue: Because of that, you’ve got like this, Antonio Goli from Deep Learning with Keras, where he says, reinforcement learning seems to be a general learning algorithm that can be applied to a variety of environments. May even be the first step to a general artificial intelligence. Well, I don’t agree. It’s a first step to general artificial intelligence. But it is very exciting. It is a general purpose machine learning algorithm. It’s the only one we really know about, okay? And the reason why it works is because of everything I just showed you mathematically, the fact that we can drop this transition function out and we can estimate the Q values and then from there make an optimal policy. So this is a very exciting area for research, regardless of whether it’s really a path to artificial general intelligence or not. Since machine learning is about modeling realities, hidden functions, we now have a general way to do that. So now let’s talk about how this is gonna work in real life. We’re actually almost done, but I wanna make sure we get in this real life example, sorry, it’ll make a little bit more sense if I show you how the values slowly perpetuate outwards. So we’ve defined our state transition function, which you can see here, which we have a reward function. We’ve got our Q learning algorithm, which we implement using code something like this. You don’t need to follow what the code is, but I put this code in terms of Python where I basically take 20 % of one value, 80 % of what it used to be, and then you’re slowly converging towards it. And we’ve got our states and actions, which is defined as this grid here.
[01:03:42] Blue: So let’s start with our little robot here that exists inside of this grid world. And it starts in the start state and we want it to find the goal. So it’s going to, since it’s Q table that it’s using initially starts off with all zeros, it’s basically just making random moves. So it moves all over the place. And just by chance, it happens to make it to state two, which is a state right next to the goal state. So, and then by chance, it happens to move into the goal state. So I’m gonna go back one. Up to this point, that formula I’ve been using because the Q values are all zero, they always return zero. So I’m doing the mathematics here you can see, but it’s updating zero with zero. Nothing’s getting updated because we haven’t reached the goal state yet. By chance, we now hit the goal state, okay? And we receive a reward of 100. How does the math work? So following the formula that I’ve listed out, you take 80 % of the current Q value, which is 80 % of zero, and you take 20 % of the reward plus the Q value one forward into the future, which in this case is also zero. So what you’re really gonna end up with is 20 % of the reward. So we work the math and what we end up with is the Q value for state two taking right is gonna now equal to 20. So right now, Q value for two right is zero, it’s gonna be updated to 20, okay? Now let’s say we do another run. So you gotta let this robot run a whole bunch of times for this algorithm to work.
[01:05:14] Blue: And it somehow just happens to wind up in state one at some point and it happens to move into state two. So again, we run the formula, you take 80 % of the old value, basically just running these numbers here, it doesn’t really matter too much. What you’re gonna end up with is 20 % of, 90 % of the 20 that’s in state two. So 3.6, we update our Q table to have a 3.6 at state one, if you take a right. You can see how, now remember, think about this in terms of the value function. Value function is just the largest action in that state. So this state one now has a value function of 3.6 and this state two now has a value function of 20. You can see how the rewards from state three are starting to spread into our little maze, okay? Does that make sense? And even if you don’t follow the math, does that make sense how it spreads and why it spreads? Yeah, it makes sense, why it spreads. Okay, this is reinforcement learning. Reinforcement learning is nothing but this, okay? You’re counting on the fact that as you explore that maze around, and as it starts to find the rewards, that the rewards are going to, based on the formulas that we’ve been playing with, that they’re going to naturally always expand outward from the reward states, from the goal state, until eventually it reaches our start state. In which case at that point, we now know what the right moves are, okay? Oh, by the way. Do you see that natural spread as intuition of some kind? Well, you’ll see, you’ll see, okay? So not yet, it’s kind of like that.
[01:06:57] Blue: This, our robots now in state two, it happens to move into the, when it’s in state two, it now knows that the goal stays to the right because it found it from state two last time. So it now knows, pick the move to the right. It doesn’t have to make a random move anymore. It gets a score of 100, we run the math again and the math works out like this. It now updates from 20 in state two move right to 36 in state two move right. It’s now increased. It now has a better understanding that if I’m in state two and I take the right move, that’s more value than 20. It’s got to be something higher than 20. It moves it to 36. It will literally, if you run this enough time, it will literally converge to exactly the values where it’s 100 in state two and 100 in state six. So that it knows, oh, if I’m in state two or state six, I need to move to the right to get into the goal state or if I’m in state six move up to get into the goal state, it will converge towards the correct Q values over time. So now it won’t, it won’t ever, there’s no guarantee it’ll find the exact right ones. The guarantee only exists at infinity, right? If you run this enough times, I haven’t talked about Explore Exploit yet. I’ll explain that in just a second. But if you do Explore Exploit and you run this an infinite number of times, mathematically it’s guaranteed to converge to the right Q values. Now, obviously you don’t ever actually do anything an infinite number of times.
[01:08:30] Blue: But as it turns out, if you just do this a whole bunch of times, it will converge to something close to the correct Q values. Okay, does that make sense? Do you follow what I’m saying there? Yes. So this is an example where they’re not quite right. I mean, I’m kind of throwing values in there. They’re close to right, but they’re not quite right. And, but those values as they converge towards the correct Q values, they’ll be good enough to make a near optimal policy. Okay. Now, there is one problem though, that we still have to overcome. This is the Explore Exploit trade -off. Since we’re originally just moving that robot randomly since the Q values all start off as zero, it’s possible the robot, like in a worst case scenario, we could imagine a robot that happens to move from the start state to here to here, and it makes like an S. It takes the longest possible path to the goal state. And let’s just say by chance, it keeps doing that. Okay. What would happen is, is that the Q values would spread out from the goal, but they would follow only the path that the robot had followed. So the values would spread, and you’d end up with a very non -optimal policy. The robot would always follow exactly the path that followed before this long winding S path. So how do you keep that from happening? Okay. Well, basically what you do is you don’t follow your optimal policy, your current approximation of the optimal policy every single time. Some percentage of the time, you just take a random move. And by doing that, it gives it a chance to explore if there’s a shorter path.
[01:10:05] Blue: And again, at infinity, it will eventually try every single possible action, and it will eventually converge to the correct Q values. So that’s why you have to have the export -export trade -off. You can’t do without it. The guarantees of converging to the correct Q values only happen if you also make random moves every so often during the training phase. What you would typically do though, is you would, they call this e -greedy. So basically you always take the best move, except some epsilon e, that’s the e, epsilon greedy, epsilon value, and you would decay that. So maybe initially you tell it epsilon is 90%. So 90 % of the time you make a random move. Then then maybe on the next generation, you make it 89 % of the time, a random move. And then you just let it keep decaying until finally, after thousands of generations, maybe you’re only making a random move 0.01 % of the time. That’s still sufficient. If you’re decaying that the occurrences of random moves, the robot will learn an optimal policy or something close to an optimal policy, and yet it will always have a little bit of a chance of making a random move and so it can continue to improve things. By the way, this is paparian, right? It’s fallibilism. It’s the fact that since there’s no guarantee it’ll find an optimal policy, you have to always be open to there might be an improvement. That’s the random exporx point trade -off. And at infinity, yes, it’s guaranteed to get an optimal policy that’s exactly equivalent to David Deutch’s All Problems Are Soluble, okay? It’s a soluble problem, but only at infinity.
[01:11:54] Blue: And as long as you have an infinite horizon, you can continue to make improvements, but it will always require some diversity of thought and will always require trying things different than what our current best theories say. And there’s a very cool critical rationalist explanation for exactly what we’re seeing here in Reinforcement Learning. All right, any questions about this? I don’t wanna move on unless this makes at least a little bit of sense.
[01:12:21] Green: It makes a little bit of
[01:12:22] Blue: sense.
[01:12:23] Green: It does make a little bit of sense. So
[01:12:25] Blue: here is a grid world I actually ran. You’ve got the ones are walls, the two is the start state, the three is the goal state, the fives are pits that you fall into and you lose. I ran Reinforcement Learning on it. And you can see that this is the optimal policy it came up with. Imagine these little signs as arrows, okay? And you can see the pits and you can see the walls. Now look at this carefully and this is pretty close to an optimal policy. So if this is the start state, it’s gonna try to move you this way, this way up here, it’s trying to stay away. It’s not taking the fastest path to the goal state because it’s trying to stay away from the pits. By the way, there’s a random chance you won’t move in the direction you’re trying to. So if it was deterministic, it would just simply go here and go straight to the goal state. But because that gives you a chance of losing the game, instead it comes up with a policy where it moves you around the walls this way and then takes you to the goal state. So that it never even gets close to the pits, okay? That’s exactly what you would expect it to come up with. Now you can see that it’s not quite optimal. Notice how like that state right there is going down and then it goes up. It’s like not very useful, right? There’s not much chance you’ll ever get to that state. And that’s exactly why it has a non -optimal policy at that state is because it didn’t get to explore that state out very much.
[01:13:47] Blue: Because it normally just followed the optimal policy and it just had no need to go there. So it might be off slightly but it’s getting really dang close to the optimal policy, okay? Now here’s a real -life example of stock trading. So I’ve got stock that I’m doing. I’ve got my portfolio. I’ve got a benchmark and an attempt to do manual trades that didn’t work that well. Basically, I ran this through the exact same reinforcement learning algorithm that I just used for that grid world. The only thing that changed was I had to change what the state of the world looked like, okay? And it came up with a bunch of trades and those trades made it so that it made a whole bunch of money. Okay, that’s basically what I’m showing here. You can see it’s trading all the time though, right? How do you actually do that? Well, I have to come up with some way to represent the state of the world into a cube, discrete cube values. The way I would do that is maybe I’d use a signal like a Bollinger band and I would turn it into a discrete signal instead and my state of the world would be based on what these discrete signals for the Bollinger bands are. How well this would work in real life, it really wouldn’t work that well but this gives you a feel for how you would actually translate the world into a set of discrete states that can then be used for cube values. Final result of this now using something in the future that it wasn’t part of the training set, it still returned pretty good results.
[01:15:15] Blue: I did find though that it only really returned good results if the future and the past were similar. So in this case, I was using JP Morgan right after the crash in 2008. It’s future and it’s past were very similar. So it came up with a fairly good policy on when to buy and sell. I tried it on Microsoft where it crashed huge 2008 and then it just took off. So it still made money when I did that but you could make more money by just buying Microsoft and holding it. So it’s not going to give you great results if the future and the past don’t happen to match. And there’s no guarantee the future and the past will match but if they do, it will come up with a policy that works better than probably anything that you can come up with. This is where it becomes like intuition Tracy where it seems like it’s intuitively knowing when to buy and sell but what it’s really doing is it is treating the world as a MDP as a more competition process and then it’s letting those values perpetuate backwards until it finds what are the good times to buy and sell based on this Bollinger Band signal and then it ends up making money based on that. If it turns out that the world has changed and those signals are no longer valid then it’s gonna not make money and that’s just kind of the way reinforcement learning works. Okay, all right, that’s it. We’re actually done. I think I would add one more thing. All of this assumes that we’re dealing with a discrete world. The real world isn’t discrete.
[01:16:44] Blue: So there’s something really cool you can do with reinforcement learning where you add a neural net to replace the Q table. So basically you say instead of using a Q table which has to be discrete you say I’m going to use a neural net as a replacement for my Q table function and it’s going to approximate a continuous Q table whatever that is and it will do it. It will actually approximate a continuous Q table using a neural net and that’s what they call deep reinforcement learning and that’s what AlphaGo actually used because the state of the world for a Go board is so large you couldn’t hold it in memory so they approximate it using a neural net instead and neural nets are artificial intuition. The net result is that they use reinforcement learning of playing lots and lots of games and it continues to refine its neural net as a substitute for the Q table and it will slowly converge to something that has really strong intuitions as to this is a good move. The reason why this works so well for something like AlphaGo though is because it’s a game and you can let it play itself billions of times until the reinforcement learning algorithm perpetuates outwards and it comes up with really good values and this is really the main thing that we have to understand is that while this is our single best general purpose problem solving algorithm in machine learning it still requires billions of examples and it still requires
[01:18:18] Blue: it requires that you actually still define what the states of the world are it doesn’t learn the states on its own so it still needs this giant injection of knowledge from a human that can say well the problem I’m solving I’m gonna use Bollinger Bounds to represent my world I’m going to here’s how I’m gonna do it you still need this giant injection of knowledge and that’s the way AlphaGo works by the way they often kind of pitch AlphaGo as it learned how to play go from scratch it’s not really true that there still had to come up with the humans had to come up with valid states that it could learn from that represented true knowledge about how it was played so there was still this injection of knowledge coming from the human it was learned a lot on its own and the end result is that it could play better than any human had ever played so clearly it’s got some sort of knowledge that no human being has and yet it’s still this interesting mix of human knowledge and knowledge generated by the machine learning algorithm and it’s hard to really tease apart those two well All right, any questions about this we really are done now
[01:19:23] Green: No, no, it’s a great topic it also reminds me why I’m not a mathematician I’m with you
[01:19:34] Blue: Do your heads both hurt at this point?
[01:19:37] Green: Yeah, yeah, I’m going to go take a nap
[01:19:40] Blue: So when we talk about animal intelligence in the future one of the key things I want to get across is that animals don’t require any sort of state world state knowledge injected into their heads to be able to learn how to do something and they can learn from a few thousand examples instead of a few billion examples So animals like this is something that David Deutsch often talks about he talks about like behavior parsing we’ll talk about that in a future thing he points out that apes, when the gorillas when they have to learn memes they have to watch each other thousands of times or hundreds of times really not thousands but hundreds of times before they can learn the movements and what they need to do from another gorilla or with a human because humans can explain things they can watch a person a couple times get an explanation for what the purpose is and then they can now perform these actions So he points out how much faster humans learn than gorillas because humans have a real full understanding and even the primates closest to us don’t, right?
[01:20:49] Blue: The thing that he’s missing there that’s all true the thing he’s missing there is that our best algorithms don’t even come close to how fast an ape can learn and so animals are like this in -between state and I don’t know if that’s meaningful or if it’s not but animals are more general purpose learners than any sort of machine learning that exists and they learn faster than any machine learning that exists even our best general -purpose machine learning algorithm which is reinforcement learning and I think that’s significant I don’t know exactly how significant but I think that that is a very interesting thing that animal intelligence seems to exist between our current knowledge about machine learning but quite a bit shy of what humans are capable of
[01:21:39] Green: Interesting.
[01:21:40] Blue: Alright, thank you guys Thank you Bruce Thank you Alright, bye bye the theory of anything podcast could use your help we have a small but loyal audience and we’d like to get the word out about the podcast to others so others can enjoy it as well to the best of our knowledge we’re the only podcast that covers all four strands of David Deutch’s philosophy as well as other interesting subjects if you’re enjoying this podcast please give us a 5 star rating on Apple Podcasts this can usually be done right inside your podcast player or you can Google the theory of anything podcast Apple or something like that some players have their own rating system and giving us a 5 star rating on any rating system would be helpful if you enjoy a particular episode please consider tweeting about us or linking to us on Facebook or other social media to help get the word out if you are interested in financially supporting the podcast we have two ways to do that the first is via our podcast host site Anchor just go to anchor.fm slash four dash strands F -O -U -R -S -T -R -A -N -D -S there’s a support button available that allows you to do reoccurring donations if you want to make a one time donation go to our blog which is four strands.org there is a donation button there that uses PayPal
Links to this episode: Spotify / Apple Podcasts
Generated with AI using PodcastTranscriptor. Unofficial AI-generated transcripts. These may contain mistakes; please verify against the actual podcast.