Episode 43: Deep Reinforcement Learning
- Links to this episode: Spotify / Apple Podcasts
- This transcript was generated with AI using PodcastTranscriptor.
- Unofficial AI-generated transcripts. These may contain mistakes. Please check against the actual podcast.
- Speakers are denoted as color names.
Transcript
[00:00:07] Blue: Welcome to the theory of anything podcast. Hey, cameo, how’s it going?
[00:00:11] Red: It’s excellent, Bruce.
[00:00:12] Blue: How are you? Good. You know, I’m we’re going to do another machine learning episode because, you know, I geek out over machine learning.
[00:00:21] Red: Yeah, can’t get enough.
[00:00:22] Blue: We did an episode back in the 20s somewhere that was reinforcement learning. And I think I mentioned in passing I wanted to do an update to do deep reinforcement learning. So we’re going to do that today. A part of this episode is sort of a reader’s digest version of the previous episode, because I’ve got to like remind everybody what the basics are and I want it to be a self contained episode. But we’re going to go fast over that. It’s like the first half. And then I’m going to explain how regular reinforcement learning differs with deep reinforcement learning. And then maybe at the end, we will talk about how this applies to our last few episodes. How does this relate to animal intelligence, maybe, or does it not relate to animal intelligence? We can talk about that. Here we have the first of all, I should mention that this is one of those episodes that you really want to understand this. You should go to the YouTube version on this because there are visuals available. And anytime I’m doing a technical episode, the visuals actually do help. So I’m going to be talking through things with Cameo and I’ll try to describe things for the audio people and a full version with everything that you need to be able to actually understand this is going to be on YouTube and you can get the visuals there. Also, I have a GitHub that actually contains code. And so I will put in the show notes link to my GitHub and everything I’m going to show you here today comes from my GitHub. So let me actually start with an example. OK, from I’ll show you one at the beginning and one at the end.
[00:02:07] Blue: OK, we’re going to we’re going to start with this one here. So this is, you know, pie charm. I probably got to switch to visual visual studio code. But because I hear that’s actually better with Python than the pie charm. But I’ve got it’s running. It’s thinking. So I’ve got some various examples of using deep reinforcement learning using my code. This is trying to pull it over here. This is the part of the cart pull example. So basically what you do here is you’ve got this platform there and you got two actions you can take. You can move left or right and help, you know, by a certain amount. It’s not just move left or right, but by a certain amount of force. OK, you’re trying to balance that pull. OK. So that’s basically the game that we’re doing. And I didn’t program it how to do that. You just saw that it reset because it fell too far one direction.
[00:03:14] Red: Sure.
[00:03:15] Blue: So it’s going to try to balance that as long as it can. And then you get a score at the end. You can see my score down here. The score first one was two seventy four, which is pretty good. I don’t know exactly on this one three twenty three twenty three that time. It’s a matter of like how long it stays up. Sure. It’s past a certain angle. It’s considered failure and you have to go to the next episode.
[00:03:39] Red: OK, so the reinforcement learner just by trial and error has learned to balance that pull pretty well.
[00:03:46] Blue: You can see. Yeah. Yeah. So that’s the type of thing we’re going to be dealing with. And I would note that this is one example. I’m going to show Lunar Lander at the end, which the famous game from the 70s. Sure. And both of them are using the exact same agent with with only a few lines of code different. Like I do subclasses and there’s a there is a few lines of code different because obviously you have to specify the differences in the environment and things like that. But it is the same agent doing both. OK. Even though they’re vastly different games. This is why people tend to see reinforcement learning as a quote general purpose learner. We’ve talked about how that’s a thoroughly misleading thing to call it.
[00:04:28] Red: Sure.
[00:04:28] Blue: But this is an example of why it feels that way to people, right? And why it’s so popular to say, oh, reinforcement learning, that’s the path to AGI. It is not the path to AGI. Right. You kind of see why people got excited about it.
[00:04:43] Red: Sure.
[00:04:43] Blue: OK. So let’s recall that there are three types of machine learning, supervised learning, which is if you’re thinking of machine learning normally, you’re probably thinking of supervised learning. So that’s the type where you actually have labels from a human.
[00:04:55] Red: OK.
[00:04:56] Blue: And expert labels. And then it’s trying to figure out a function that given that data can get to that correct label. And over time, it gets good at it and it comes up with some if let’s say it’s a neural network comes up with some neural network that allows it to, you know, operates as a function that calculates what the correct answer is. And it gets it gets high enough scores that it becomes useful at some point. Unsupervised learning has no expert label. Usually that’s like, you know, mining data for finding clusters of users that are similar or something along those lines. And then you have reinforcement learning, which is what we’re going to talk about, which is kind of its own thing. So reinforcement learning, as we’ll see, it’s it’s a little it almost sounds like it’s in between supervised and unsupervised learning, but really it’s like on a different axis altogether. So another thing I’m going to talk about is open AI, Jim. Are you familiar with open AI, Camille?
[00:05:54] Unknown: Yeah.
[00:05:56] Blue: So that would be Elon Musk’s. Well, actually, it’s not true. It’s Elon Musk’s company. That’s how people say it. He’s just one of the people who helped initially fund it. Sure. And he’s definitely the one the media has connected to. But I don’t think he was ever the primary funder. I mean, I think there was a whole bunch of funders.
[00:06:17] Red: Yeah, yeah, that’s that’s what what Elon Musk does to everything. If he touches that, it becomes his right,
[00:06:24] Blue: which in a lot of cases is a good thing. But yeah, he can sometimes not. Yeah. So open AI, Jim, open AI is this company that was. So keep in mind that Elon Musk is a megalomaniac. OK, you know, pretty much by his own admission. And he very sincerely believes that it’s his purpose in life to save humanity. And if you don’t believe me on this, you should believe me because it’s all true. But just read the biography that came out about Elon Musk. And this is what they really kind of help you understand. And this is how you in some ways it’s endearing, right? You say that sounds negative, but it’s not. Right. I mean, this is a guy who’s really serious about trying to solve humanity’s problems and is actually successfully doing it more so than most people. Yeah,
[00:07:18] Red: like what he’s done with SpaceX is remarkable. Yes, remarkable. Like if you had said 10 years ago that somebody would come and do what he has done within our space program or outside of our space program, everyone would have told you in a way. Yeah, yeah,
[00:07:36] Blue: absolutely. And it’s a good example of how I mean, like. We you would have said, no, you have to compete with the Russians. That’s a government. You can’t compete with the government. And often even my libertarian friends who believe we should get rid of governments will say, oh, but the government stops you from competing with it. This is a really good example of how that just isn’t true. You know, I mean, it’s yes, it’s hard to compete with the government. But really, at the end of the day, Elon Musk said, I can now compete the government, right? And then he did. And now all the governments hire him to do their space missions instead of the Russians, you know, because he’s better at it. And a good example, how if you can figure out the right knowledge on how to apply capitalism to a problem, it’s really good at it over time. Now, of course, he had to risk his own money to do it. He had to believe in himself enough, which is where the megalomaniacness comes in and where it’s helpful. He had to believe in himself enough that he would actually put his money on the line and almost go out of business multiple times. Again, this is from the biography and then managed to pull it off. Get get far enough that he’s he’s created the knowledge that’s necessary to outcompete his competitors. Yeah. Now, if you look at his other ventures, so what’s so SpaceX is really exists because Elon Musk doesn’t want an asteroid to hit Earth and for humanity to die. That’s really why it exists. That is its primary purpose, actually.
[00:09:11] Blue: And so in the meantime, he’s going to make money figuring out the technology that is necessary to colonize Mars. And that’s that’s his whole business plan for SpaceX. And I know it sounds crazy, but that’s what he did and he is doing it. Um, Tesla, you know, that exists because he wants to get us off of fossil fuels. So he decided to make electric cars so cool that everyone’s going to have to now compete in that space. And they’re just as much. Yeah, and they are. Minimally at this point, but it’s more than a little obvious. This is now where things are going.
[00:09:48] Red: Well, I, you know, I work with the biggest automaker in the US and I can assure you, people, the executives are scared to death of everything Elon Musk represents and everything that Tesla does. They they they know they know that the change has happened to the core of our society around electric vehicles already and that they are not ready.
[00:10:17] Blue: Yes, they’re terrified. Yep. And they should be. And then so open AI, he has this well publicized fear that we will create AGI and it will take over the world.
[00:10:31] Unknown: Sure. Now,
[00:10:31] Blue: being far more Deutsche, and I don’t think that’s really the threat that he’s making it out to be. But he does. And so he’s decided the way to deal with that problem is I’m going to make a company that goes out and does really cutting edge research in AI, which, as we’ve talked about, is actually the wrong direction for AGI, but he doesn’t know that either. And I’m going to get it all published and make it public domain type stuff. I don’t know if they really ended up doing this, but that this was the original idea so that it’s all being out there criticized in the open and that way we got a much better chance of not having some corporation come up with an AGI that takes over the world and instead will have people knowing about it and the knowledge will grow with lots of ideas. And it’s actually kind of a clever idea. Like, given given his viewpoint, which, you know, I disagree with, you can see where he’s coming from. And he’s he’s probably the only person I know of who talks about a safety that actually had an implementable idea on how to go about it on how to on how to deal with it. Yeah. Right. So this is what open AI is open AI. They’re obviously really just researching narrow AI. They’re not really researching AGI because nobody really kind of knows how to do that at this point. And they don’t really understand the difference. And you can tell that from like reading the papers, but they’re doing some really amazingly cutting edge stuff.
[00:11:57] Unknown: Right.
[00:11:57] Blue: I mean, like they are definitely pushing the boundaries of machine learning in ways that people didn’t think were possible. And the really the only competitor in this space to them, money wise and in terms of accomplishment would be Google’s DeepMind, right, was the other big famous one.
[00:12:13] Red: Sure. So
[00:12:14] Blue: open AI has a thing called open AI gym, where they have a bunch of little like the cart pull that I just showed you, a bunch of little arenas, gyms, where you’ve got an agent and you’ve got an environment and you can use your build your own reinforcement learning agent to work with it. And so that’s what I’m doing. And I intend to do more of these. And I’ve got three working right now. I’ve got the taxi problem and Lunar Lander working. And then I got the open cart, the cart pull working. There’s a whole bunch of others. I wanted to see how many of these I could apply the same agent to. When I say same agent, of course, I mean similar agent. But, you know, it’s it’s been fun trying to do it. And for anyone who’s interested in this space, go to my my GitHub that I’ll link to. And there’s that’s all the starter code you need there. And by the way, the starter code is like under, you know, it’s about 400 lines of code. It’s not a lot. So this isn’t like super complicated stuff code wise. It may be a little complicated to initially understand the math and things like that. But that’s why I’ve got these these videos to help you understand that. OK, so. Characteristics of reinforcement learning. So it’s based on exploration. So trial and error, we might say, for for us, paparians that are interested in that sort of thing, delayed rewards. When you’re playing AlphaGo, you get one reward at the end if you won or lost. And then you have to figure out which how much to credit each move during the game for that win.
[00:13:54] Blue: And that’s the mathematical problem that reinforcement learning is trying to solve. And if you’ve seen the previous episode, then you know what I’m talking about. Also, mathematically. And then it’s got continuous learning. I’ve never really seen this used. I’ve never mentioned this before, but you can immediately see that an agent like this can continuously learn once you understand the math and how it works. But like, I’ve never actually seen anybody implement that. Like, typically you still have a training phase and then you use the agent at the end. You know, and it’s not learning anymore. And that’s definitely the way I coded mine is that there’s a learning mode and a non learning mode and a perform mode. OK, but but it could in principle continuously update as the environment changes and keep learning, OK, which is kind of a cool concept that isn’t doesn’t really exist in supervised learning. You have to kind of do it in generations instead of supervised learning. OK, so it’s similar to supervised learning in that there are no rewards for good behavior that there are rewards, sorry, for good behavior, similar to a loss function in supervised learning. It’s similar to unsupervised learning in that there is no correct result to work with. There’s not a human expert telling you what the correct result is and giving you a ground truth. It is instead learning from the the experience basically of trial and error moving around and receiving rewards. So the reward signal is really what it’s learning off of. OK, now it’s not semi supervised learning, though. That’s really a different thing altogether, which is why I say it’s almost like it’s on its own axis. So
[00:15:32] Blue: semi supervised learning would be where you have a few labels, but mostly you have data without labels. And then there’s kind of these whole set of techniques where you use a combination of supervised and unsupervised learning to try to do semi supervised learning. Like that’s like really obviously in between supervised and unsupervised reinforcement learning has some of the characteristics of both, but it’s really just its own thing.
[00:15:56] Red: Right.
[00:15:57] Blue: And we talked about in AlphaGo why reinforcement learning was the big breakthrough that allowed us to beat the go master. Now, if you recall from the previous episode where we talked about this, this is about the Markov decision process. So reinforcement learning is about trying to solve the Markov decision process, an MDP, which consists of a set of states the agent can be in. That’s I call that the world space sometimes, but it probably should be called the state space. So like in the example that we’ll use again, a grid of nine spaces, it’s got nine states it can be in. OK, it’s a set of actions that can be taken. So up, left, right, up, down, left, right, maybe. And then at time T, the agent senses a current state at current state, which we’ll call ST. So that’s the state at time T takes an action that we’ll call that A at T or action at time T. The environment gives us a reward in response and then moves the agent to a new state, which would be state ST plus one, because now you’re one time slice forward, if you will.
[00:17:08] Red: OK,
[00:17:09] Blue: the environment utilizes something that we call the state transition function. So this little funny Greek symbol here is just the name of a function. I don’t know why they decide to make it hard by using Greek symbols. You’re passing in the as your parameter, you’re pressing in the the the ST, which would be the state at time T, so the current state and the action you’re taking at that state and it’s going to pass you back the new state. OK, so state at T plus one and the environment has a reward function of some sort. And so that’s a function where you pass again, the current state and the action you intend to take. And it passes back what the reward was for that. OK, now, you know, in a lot of cases, the reward is zero. Like if you’re playing a game of go, you only get a reward at the end of the game. So the reward is zero, zero, zero, zero, zero. Sometimes they might try to train it by giving it little subgoal rewards, but that often ends up screwing up the whole system to do that. That requires a great deal of skill and experience to figure out when to give fake little rewards to encourage the agent and when that’s going to just be a problem. Um, so key assumption here, though, is that these two functions depend only on the current state because that’s all you’re passing in. So there’s no need to track history of past states. So the Markov decision process is inherently based on whatever the current state is.
[00:18:43] Blue: Now, this might sound really limiting because so many of our problems in life, you know, deal with things in history, not in just the current state of things. But if you really think about it, the laws of physics are an MDP, right? I mean, when you, if I have a point in space and I know it’s current velocity, I know it’s acceleration, then I can tell you, I don’t need to know where it came from. I can tell you where it’s going to be at the next time slice, you know, a second from now, I only need to know what the current state is to be able to give you the next state. So the entire world is really an MDP process in that sense. And you can always cheat and you can simply declare that part of the state is something from your history.
[00:19:25] Red: OK.
[00:19:27] Blue: So it’s not as limiting as it first sounds. However, it is often difficult to take a real life problem and figure out how to force it into an MDP. Well, in theory, it can always be done in practice. It may be very, very hard. So but reinforcement learners are always going to be solving an MDP. So mathematically, the Bell equations is what we use to solve MDPs. So an example of the Markov decision process here, here’s our three by three grid. It’s a maze without walls. You start in the lower left hand corner and the goal is in the upper right hand corner. And there’s nine states here and there’s four actions up, down, left, right. I drew it on the page where there’s no like down on square seven because it’s at the bottom left hand corner. But really, there would be it just would put you back into state seven. I just didn’t want to have to draw that. And so that’s basically a really simple Markov decision process. OK. Now, what’s an optimal policy? Because what we’re really going to be solving for is an optimal policy. We’re going to use the Bell equations to solve for an optimal policy. OK. So in this really simple example, this maze without walls, if you can see my screen, every single state, every single square in the maze has an arrow that tells you if I’m in this state, take this action. So if I’m in the lower left hand corner, go up. Now, that would be an optimal policy, right, going on. Now, obviously, there’d be two optimal policies here would also be to go to the to the right.
[00:21:09] Blue: But you what we’re looking for is what’s the shorter shortest path to the goal. And that would be a shortest path to the goal for that square. So that’s an optimal policy. You look at these arrows that I’ve got on the screen, every single one of them always sends you the quickest way towards the goal. OK. There can be more than one optimal policy in this case there is. We could have had all these arrows go to the right instead, and except for these going up, that would have also worked. But it’s basically the same idea. It’s just a transpose of what I’m showing here. And we just want a optimal policy. We don’t we’re not trying to find all of them. OK. There often could be many optimal policies. We just want one.
[00:21:51] Unknown: OK.
[00:21:53] Blue: Now, we have another concept, the optimal value function. In this case, the goal is worth 100 points. So being in a square next to the goal should be, you know, discounted by. You know, to 90 percent of that, so it’s worth 90. The school that the square next to those is discounted another by another 90 percent. So they are score of 81 and you just keep going from there as you get further away from the goal. OK. That’s a fairly simple concept. It’s really easy to see. It’s based on kind of the, you know, the economic concept of discounting the future. Right.
[00:22:26] Red: It’s also kind of really similar to playing warmer and colder with a kid when you’re trying to find something.
[00:22:34] Blue: Yes. Yes. OK. Now, if you had an optimal value function, you can convert that to an optimal policy. Just basically you whatever square you’re in, you look at the squares next to you, what those values are and you move towards the one that’s the highest. Right. And if there’s a tie, then you just pick one. And so it would be easy, trivial, in fact, to compute an optimal value, an optimal policy from an optimal value function. OK. So if we can get to an optimal value function, then we have an optimal policy. OK. And I’m going to keep playing this trick where if I can get to this one, then I can get to that one. And this is how math works. So the Q function is the next one. Q function is like an optimal value function, except that instead of assigning the value to the states, we assign the value to the actions in the state. OK. Well, now, obviously, if you had a Q function, you could get to an optimal value function. You would just simply whatever state you’re in. Let’s say you’re state two here. You would simply take the maximum action of the action that has the maximum value for that state. In this case, that’s 90. So we would assign it a 90 to the state. Does that make sense?
[00:23:45] Red: Yeah, that makes sense. OK. So
[00:23:47] Blue: again, each of the steps individually is very simple and easy. It’s very clever, in fact. I love this stuff because you kind of see the genius of how it all fits together. So here is how would I actually calculate a Q function so that I can get to the value function? Well, let’s say I, in this case, since I made this maze myself, I know what the state transition function is. OK, I’m making this game up, so of course I know what it is. So the transition function is this table here. Basically, let’s say I’m in state one and I decide to go to the right. So that says state one, right, that’s my input. The output is state two. So I’ve moved to state two. Makes perfect sense, right? I just put all that into a table. Sure. That table is now my state transition function. OK. Likewise, I know what the rewards are. So I can make a reward function very easily. It’s basically state three is worth 100 and everything else is worth zero. OK, very simple, right? You get 100 points for reaching the goal and not for anything else.
[00:24:58] Red: Yeah, that makes sense.
[00:25:00] Blue: Now, really, I would probably want to do this, where I put each state is actually negative one. Just mathematically, it works out nicely because it forces the agent to realize I shouldn’t be dilly -dallying. It will learn faster because of that. They work out in the end to get the same result. But, of course, we want our programs to run as fast as possible, so this is typically how we do it. However, for humans, when I’m trying to explain the math, it complicates it and makes it harder. So I’m going to go with rewards of zero, even though I wouldn’t do that in real life, except for the goal. So if I have the state transition function and the reward function, which I do in this case, I could use something called dynamic programming to calculate what the optimal value function is or what the Q function is. From there, I get the optimal value function, et cetera. In fact, I could actually calculate the optimal value function directly. I don’t even need the Q function if I’ve got the state transition function and the reward function. So dynamic programming, I won’t go into that, but it’s a little misleading of a term because it’s a term that’s been around before electronic computers. So when they say programming, they don’t mean computer programming. It’s similar to linear programming in concept where the word doesn’t existed before electronic computers. So it wasn’t referring to computer programs either. It’s all a little bit mixed up though because the word computer itself actually predates electronic computers. It used to be a bunch of women that you paid to have a note card and to do a computation for you.
[00:26:39] Blue: And then they would act like what would today a computer would, passing cards to each other, doing their little computation, passing it to the next person. The movie Hidden Figures is about the women that held such jobs back then.
[00:26:52] Red: And
[00:26:52] Blue: they were called computers. So that’s when the electronic computer came in, it was replacing that job. So that’s why they called them electronic computers. So now here’s the thing though, what if I don’t have the state transition function or the reward function? Well, in real life, like if I’m trying to make an agent that is going to try to figure out the stock market, the stock market in principle has some sort of state transition function going on. But nobody knows what it is, right? I mean, it’s determined by the actions of every person in the world and what the news is today and all sorts of things. There’s no way you could go right down what that function is.
[00:27:39] Red: Right.
[00:27:43] Blue: So what you would do in a case, so since you don’t have the reward function or the state transition functions, you can’t use dynamic programming to solve for the optimal policy. So what we did in our previous episode, I went through the math to get to this equation here, which is a certain form of the Bell equation. I’m not going to go through that math again. You can go to the other episode if you want to do it. It’s not in the audio version. It’s only in the video version. But if you’re interested in that, which if you’re into this, you should be, go take a look at it there. But basically, you start with the basic Bell equation and you want to try to get it into this format which basically is recursive. So you have this Q function. So remember, the Q function consists of a state and an action as the inputs. Right. We’re saying that it’s equal to the rewards at that state and action plus this is the discount factor of the maximum action of the Q function for the next state and next action you would be taking. This is remember the transition function. So this is effectively saying the next state. That you find yourself in. So this is recursive. It’s the Q function in terms of itself. Because it’s recursive, it allows us to play a trick, which is where the cleverness comes in. So basically imagine starting with a table of Q values. You can represent the Q function as a table. And remember the Q table is the value for a state and an action together of a tuple of a state and an action. So set them all to some initial estimate.
[00:29:30] Blue: And for the sake of argument, let’s say that initial estimate is zero for everything. The worst possible estimate. So it doesn’t have to be a good estimate. It’s just any estimate at all. That’s why I’m putting estimate in scare quotes is because it’s not really an estimate at all. Just set it to zero. Then follow the following algorithm. Select an action A and execute it. Receive immediate reward R. Observe which new state you are in. And then update the table entry as follows. So basically you take your current Q table entries for state and action, and you’re gonna update it to be based on the reward you received, plus a discount of whatever is the best action in the next state. Okay, so by doing this, it will mathematically it’s possible to prove that this will converge at infinity, unfortunately, to the optimal Q function, okay? Which then you from there, you can turn it into the optimal value function and then you can then turn that into the optimal policy, okay, because we know how to convert between those.
[00:30:47] Red: But the infinity’s a problem, right?
[00:30:50] Blue: Yes, so now like many things in life, think about our computational theory episodes. A lot of times guarantees come at infinity, which will never reach. But if you have a computation that goes to infinity, it never ends, so it’s not useful to us. But a lot of times before infinity, you can get to a useful approximation. So if you think about like in computational theory, you might have an algorithm that is intractable. So it does end, it’s not going to infinity, but it’s exponential, so the sun’s gonna explode first before the computer finishes running it. No matter how fast your computer is, that’s true, okay? If you simply reduce it a little bit, so let’s take the traveling salesman example, which is where you’re trying to find the shortest route between all the cities so that the salesman can get home as quickly as possible. That’s an exponential problem. It’s basically intractable, it’s impossible to solve if you’ve got even a small number of cities you’re trying to calculate on. But if you instead say, look, I just want something that’s close to the best, suddenly it becomes a tractable problem, okay? So a lot of things in life are based on trying to relax the constraints slightly. And when I say that there’s a book called Algorithms We Live By or something like that, I’ll have to look up the actual term and put it in the show notes. But it’s an excellent book and it talks through how the types of problems we bump into in real life, like say, trying to keep your desktop organized, that they’re actually computational problems. And so there’s different solutions to that computational problem. Since the actual solution is intractable, so it’s intractable for a human.
[00:32:41] Blue: If it’s intractable for a computer, it’s also intractable for a human because that’s what computational theory says has to be true. So the way you would solve it is you would figure out which constraint to relax. And then there’s different heuristics that you can use that you wanna keep up to date on email, do you go through and answer all the quick ones first, try to reduce it. And these are actually types of algorithms that are like well -known sort or search algorithms that are being used. And they have these experts that like teach you how to live your life and what they’re really doing is they’re teaching you a good algorithm, okay? So that’s what we’re gonna be doing with reinforcement learning. Since we can’t run this to infinity, we’re going to just run it long enough that we get a good agent that seems smart enough to us, okay? And this turns out to be tractable and it’s possible to do it for many, many interesting problems. So we now have our mathematical approach of how we’re gonna do this. So let’s take through how this actually works. We now at this point, we have a Q function. So here’s our Q table with the state actions and the estimated value of zero at the moment. We have our states, nine states. We’ve got a goal, we’ve got a start state. We’ve got the actions that are available in each state. We have our Q learning algorithm. And then down here, I’ve got the Python version which is a lot easier to understand if you’re a programmer than the way I tried to explain it.
[00:34:08] Blue: Basically you can see that you’re taking the old value and waiting that to some degree based on some alpha that you set. And then you’re taking the new value and you’re waiting it by, so the alpha minus one versus the alpha. So let’s say 50, 50, you’re waiting the old value half and the new value half, okay? You can adjust what that percentage weight is but it’s always has to add up to one. And then we can see that this here’s the R right there. That’s the rewards. So here’s the thing that’s interesting. It’s the rewards that makes this work. The reason why this converges to the right answer is because the rewards will force it to over time, okay? As long as you don’t move it too fast and you only update it a little bit at a time that’s called the learning rate, it will converge to the right answer over time. So let’s have this as example that I used from our previous podcast. He got our robot here and initially our Q function is all zeros. So our quote optimal value function is basically random. So our little robot is moving all around and it just by chance finds itself at state two. Now up to this point it’s been updating the Q function but because all the values are zero the update always gets reset to zero. So there’s no real updates going on. I mean like the computer is doing updates but it just takes zero and resets it to zero because as of yet it’s never found an actual reward. Suddenly it moves from state two into state three that’s the goal, gets a hundred point reward. Here’s what happens mathematically.
[00:35:44] Blue: So we’re in state two, we moved right. I used basically that formula that I showed you a moment ago on the screen. If you’re audio version you don’t need to know what it is. What the idea here is is that this formula really works and it gives you a value of 20 and so now you update your Q table at two right to be 20, it’s no longer zero, okay? Now we know the right value there should be 90 because I just showed you it should be 90 because it should be 90, I’m doing a 90 % depends on what discount factor you’re using but I’m using a 90 so it should be 90, okay? So but 20 is closer to 90 than zero so we’re headed in the right direction, okay? So now let’s say that our robot’s out there moving around and by chance it ends up in space one and remember that space two now has a value of 20 in it. So by chance it moves into state two, we run the numbers again and voila, suddenly state one to the right now gets a value of 3.6, okay? Just because that’s the way the values work. So now we update our Q table and one to the right has a value of 3.6. So basically we now know just using, right now we’ve only run this twice. Normally we run it hundreds of times but we already have an arrow in state two pointing towards state three and an arrow in state one pointing towards state two because 3.6 is better than zero and 20 is better than zero, right? We’re adding knowledge to our, that’s right.
[00:37:22] Blue: Okay, now when I move into, now it knows to go to state three because it’s in state two. So it uses that knowledge, it moves to the state three and boom, it scores another 100 points. So let’s do the update again. Remember we’re at 20 already, we add in the new reward of 100, you do the factors and now it updates it to be 36 which is closer to 90, which is the correct answer, okay? Now you might think that if we did this enough times it would eventually get over 90, it never does. Because of the way the math works it cannot get past 90. So it will converge at 90 at some point and then every update after that will just reset it to 90 again. That’s the beauty of the math, okay? So imagine that we run this a few hundred times and we’ve now got Q table estimates that look something like this, okay? Where they’re much closer to the right numbers. So two of the rights now at 90, to the left is 78, that sounds pretty good. Two down is 80, they’re not quite right but they’re starting to get into the ballpark of the correct optimal numbers, okay? This is what we’re looking for, is we’re gonna run it enough times that over time it will converge to the right thing. Now there is a problem though that we haven’t solved and that is let’s say just by chance our robot took this path and it basically an S path, the longest possible path that goes like this and it gets into the goal. And let’s say on the next one it happens to do exactly the same thing. It goes like this.
[00:39:01] Blue: And then on the next one it happens to do exactly the same thing. And it just keeps doing that just by chance. What you would end up with is a policy that was that set of arrows. Right, the very worst path. Right, so we don’t want that. So what we do is we do something called the Explore Exploit trade off, something I’ve mentioned several times in past podcasts and basically what this means is that some percentage of the time you use your knowledge to take the best path and some percentage of the time you just make a random move. And you basically reduce that over time. So maybe initially 99 % of the time you’re making a random move and maybe near the end 0.001 % of the time you’re making a random move. But you never go to zero because the guarantees of success only exist at infinity. So you always wanna make a little tiny bit of other moves just in case you’ve missed some faster path so that it has a chance of exploring it.
[00:40:02] Red: Interesting.
[00:40:04] Blue: And now let’s put this into paparian terms. This whole process is multiple levels of trial and error. Okay, there’s a certain amount of randomness that I set it all to zero and so it has no reason to prefer one move over the other. Maybe I set that to always move up or maybe I set that to just pick a random move just depending on how I program this. And then on top of that I’m intentionally giving it the export trade -off. They call it an e -greedy factor where I say, okay, my e -greedy setting is 80%. So 80 % of the time take a random move and then 20 % of the time use your policy and figure out what the right move is. And by doing that, I’m just given a chance to just by trial and error explore the environment and then my numbers will converge to good numbers because of that, okay?
[00:40:55] Red: Interesting.
[00:40:56] Blue: So here’s a real life maze example. I need to still program this one up but this is something I did way back in school. So this maze here, the ones are walls, the fives are pits, this first eight is the starting point, the rest of the eights are the path to the goal, the goal is the three, okay? And notice that it found an optimal path this way. Now why is that? It’s clearly there’s a faster path this way, okay? Well, the reason why is because this was actually a stochastic environment. So if I tell it to move up, there’s a 10 % chance that instead it’ll move left or right. So like in real life, robots don’t do exactly what you tell them to because, you know, physics and friction and things like that. So we model that by saying there’s a chance that something goes wrong. And this process can, an MDP can handle that. It’s not a problem. So if it took this path, there was a chance it would slip into the pit and the pit is worth, let’s say negative 1000, okay? Well, the moment it slips into that pit, immediately this negative value starts coming out of there and propagating back numerically, okay? And the agent immediately learns, oh, I’d better go this way to avoid that possibility, okay? And it knows it can get stuck up here if it goes up into the wall and eventually it explores it out. And here I have the policy that my agent actually found. I have to use a little bit of imagination here but that’s an arrow to the, this is a policy. So this is an arrow to the going right. This is an arrow down.
[00:42:33] Blue: This is an arrow to the left. This is an arrow up. Okay, there’s the pits, there’s the walls, there’s the goal. And now look at this, you can see that this is pretty close to an optimal policy. I start here, I go there, I go there, I go there, there, there, there, there, there, there, there, there. And I go right into the goal, okay? So it did find something pretty close to an optimal policy. Now it’s not quite optimal because of like say that right there, okay? If it actually by chance slipped into there and it was following its policy all the time, it would actually get stuck in an infinite loop. Right, right. Which is one of the reasons why, well actually it wouldn’t because there’s a chance it would slip out at some point. So eventually it would. But if it wasn’t stochastic, wasn’t random, it would get into an infinite loop. So it’s not an optimal policy but it’s close to an optimal policy. In fact, it’s close enough that if I run this, that agent’s gonna look smart, right? It’s gonna look like it’s figured out how to avoid the pits and get around the walls and get to the goal, okay? I also tried this for the same class. This is a machine learning for trading at Georgia Tech. I did this for trading. So here was an example of J.P. Morgan and here’s how much money I would make if I just bought J.P. Morgan and held it. Here is me manually trying to beat the market and here is my trained learner. Now it’s cheating. It gets to see what actually makes it money and then I’m just running it on the same trained data. So
[00:44:07] Blue: of course it’s gonna do awesome. In real life, what we tried doing is we then tried using it on a different time period that hadn’t been trained on and it still made money just on as much. That’s what you’re hoping for, right? Now I have to admit, I tried this, by the way, this is the trades it made, made a ton of trades trying to beat the market. I tried this on Microsoft. So this was trained during 2008 when J.P. Morgan tanked and stayed tanked for years later, right? Microsoft tanked during 2008 and then it took off like a rocket. So I tried doing this on Microsoft and it did make money, but less than just buying Microsoft and holding it, like quite a bit less actually. Now you think about that, that kind of makes sense, right? What we’re really doing is we’re training it for an environment that’s specific to 2008 which was the big crash. So it learned how to deal with a 2008 type environment. What Microsoft did after 2008 was nothing like that. So it’s a totally different probability distribution. So of course the agent’s not gonna do well. It’s almost like trying to train the agent on one maze then give it a totally different maze, but it tries to use the same policy, okay? Right,
[00:45:24] Red: yeah, that makes sense.
[00:45:26] Blue: Okay, so there’s obviously a judgment call there. Do I as the human think my agent that it’s training area for the market is like the word market I’m currently in and if it’s not, if I’m wrong, then the agent’s gonna do poorly?
[00:45:42] Red: By the way, that’s why you should probably be distrustful of machine learning agents doing the market.
[00:45:48] Blue: There’s clearly somebody who makes money out there doing it, but I suspect most people do not. So now let’s talk about Lunar Lander. It was a famous game built for the deck back in 1973 and then Atari published it in 1979. And you had this little lander and you tried to fire rockets. You could fire the booster or you could fire the left or right rotation and you would try to orient yourself and get between these two flags there, okay? So this is another one of the games that’s in open AI gym. So I wanted to try this one. So actions for this game are do nothing, fire the left orientation engine or fire the right orientation engine or fire the main engine. So I’ve got four actions I can take. All actions are discreet, they’re either offer on. By the way, a lot of this I’m taking from various papers, solving the Lunar Lander problem for uncertainty under the reinforcement learning by Gadgill at all. The state for this game consists of eight values. So remember our state in the maze was just a single value. It was just one through nine, right? An integer one through nine. So in this case, the state’s actually a tuple of values. It’s the X position, which is a real number. The Y position, which is a real number. The X velocity and Y velocity, which are also real numbers. The angle of the lander, which is a real number. And then finally, if the left or right leg is contracted to the ground, those are both boolean. So they’re discreet, okay? Same paper cited for this. So this is the state that exists for this problem. Now we have a problem right away.
[00:47:32] Blue: And that is that these are real numbers. So everything I just taught you about reinforcement learning doesn’t apply because a queue table is always discreet. A table is by nature discreet. So how do you handle real values in a state like this? The state space is in fact infinite because of that.
[00:47:50] Red: Interesting.
[00:47:51] Blue: So there is no queue table. The queue table would be infinite, right? It just, it couldn’t exist in real life, okay? So you cannot use queue learning to solve this problem. Well, that’s not quite true. You can kind of use queue learning to solve this problem. What you can do is you can do something called desk retizing. So, and I actually tried doing this. So you have, so here’s a Bollinger band that I used for the machine learning for trading. That was one of the signals my agent was gonna use. Well, it was continuous also, but so instead I turned it into a desk retized version. But you can immediately see that this graph there looks a lot like that graph there even though a lot of the resolution’s lost, right? So that’s probably gonna be good enough for an agent to just desk retize it like this. But for the Looter Lander, it doesn’t work. And here’s why. I tried making a queue table with a million entries. So I’ve got all those values. I’ve got to turn them into a translation of a queue table that’s got a million entries. Well, with a million entries, you would have to run that game probably billions of times to fill up all the entries. I mean, it just wouldn’t work. It would not be tractable. So most of those entries stay zero no matter how many times I run it. It just always stays zero. Furthermore, even if I did somehow manage to fill up that table, it’s gonna treat a lot of very different situations as if they’re the same because they just happened to fall into the discrete area of where I happened to desk retize it, right?
[00:49:39] Blue: It’s never gonna be a great agent if I’m trying to do desperatization like this, okay? So we have a different approach we’re gonna use, deep reinforcement learning. So here’s the intuition. Neural nets can theoretically approximate any function. My source for that is multi -layer feedforward networks are universal approximators by Kurt Hornick. He proved in that paper that neural networks, the space of neural networks is can, any function can be made in the space of neural networks, okay? This is called the universal approximation theorem, okay? So he basically proved that, to put this another way, he proved that neural networks are universal. So their equivalent to a Turing machine, they can run any algorithm is really what he’s proving, okay? So based on that, we know that neural nets are a great option for trying to approximate a function that we don’t understand. Here’s the catch though. When I say neural nets, it’s no one particular neural net. It’s the space of all possible neural nets that we’re talking about. So it’s not like a single neural net is equivalent to a Turing machine, it’s not. No single neural net is. What they really mean is you can try different neural nets and somewhere in there is gonna be a good one that’s gonna approximate your function well. You may not know what it is. It may be very hard to find, but we’ve got this intuition now based on this mathematical proof in this paper that neural nets do well at approximating functions. There’s still research going on as to why they do so well because I don’t think we fully understand why they do so well.
[00:51:21] Blue: This is part of it, the fact that it was universal, but that’s like a loan doesn’t explain it. So there’s something more going on here and I’ve seen other really interesting papers trying to explore that. One from Max Teigmark where he, I didn’t read the whole paper yet, I read the first part, but he basically shows that, he basically shows that just because of the way the laws of physics work, that the space of the most interesting problems would happen to be the computable ones that a neural network could find. I don’t remember exactly how he goes about trying to prove that, but there’s interesting research going on trying to understand neural networks better. So what if we took the Q table? Q table is a function, okay? Everything in life is a function, but the Q table is just obviously just a function. Q function is state in action, goes in, it returns the Q value. What if we just decided we’re gonna use a neural net to replace the Q table? We’re going to let it, the Q function be approximated by a neural net, okay? Would that work? Would that solve our problems? So in principle, the answer is yes. That’s exactly what neural nets are good at, is approximating a function like this that would be otherwise intractable for us. If I’m trying to make a neural net that recognizes faces, I’ve got no clue how to go about programming that directly, but I know how to create the training program that will by trial and error, find the right solution using gradient descent, right? For that neural net. It’s the same sort of idea here, okay?
[00:52:59] Blue: That we can have some faith that neural nets are capable of approximating the Q function and that it’ll do a good job of it, okay? Now, here’s the problem though. How do you train a neural net to replace the Q function? Well, normally we would use supervised learning and supervised learning, you have to have a human who labels the ground truth. Well, who’s going to do that for the Q function, right? This is a function that doesn’t have, like we don’t know what it is. That’s why we’re trying to use a neural net in the first place. That’s why we’re trying to use a learner in the first place and it’s too big, it’s an infinite space. So that’s why we’re trying to approximate with a neural net. So there’s no way a human can teach a neural net how to train using labels, using regular supervised learning, okay? We’ve got no ground truth, so that’s a problem. So let’s propose a solution. Somebody, I suspect that somebody was drunk one day, they were in the bar, they’re drunk, and they go, oh, let’s let the neural net train itself. And everybody thought that was a great idea because they were drunk. You know, I say this jokingly, but that actually happened. The guy who wrote the book, Deep Learning, I forget his name, but he invented a generative adversarial networks and he did it while he was drunk in a bar and they were having an argument about amongst a bunch of computer nerds and they were drunk. So he said, oh, I can make it work. And so he went home, like the next day, he’s like, that’s never gonna work, right? And he goes and tries it and it happens to work.
[00:54:41] Blue: And a lot of stuff in machine learning is like that. It’s more of an engineering discipline than a science, right? They just go try stuff and some stuff works and some stuff doesn’t. So the idea of learning the neural net train itself is the idea that they came up with. It sounds so completely absurd. And this is why I called this presentation bootstrapping intelligence. It’s literally that the neural net has to bootstrap itself. It has to teach itself how to approximate this function. Okay. And I can’t even believe it works at all, but it does. It actually works, you just saw the current poll, it really works in real life, okay? Now I should note that again, it’s sort of an engineering discipline. If I were to go run my program, some percentage of the time it will fail miserably to work and some percentage of the time it will work. So what I actually do is I use pseudo random numbers, I use a seed, I go try them, different seeds. That way when I find a seed that actually works, I can repeat the success over and over again, okay? So pseudo random numbers, in case you don’t know what that is, when you have a computer create a random number, obviously a computer can’t, it’s just entirely deterministic. It can’t create true random numbers. So they have something they call pseudo random numbers, which is a distribution that looks random, but it’s actually entirely deterministic and you can recreate the same set of random numbers over and over again by giving it what we call a seed, what the starting value is that it uses in its little algorithm.
[00:56:18] Blue: So pretty much all random numbers in computers are really pseudo random numbers and we don’t know of any algorithms that require random numbers. We do know some that are better with true random numbers, but none that require it. So that’s why we don’t, it doesn’t change the space computational, under computational theory. There’s no algorithms that require random numbers, therefore that doesn’t need to be part of what a computer is. And that’s why computers are not built with random numbers, either theoretically or physically, we don’t build them with random number generators. We could, like I’ve heard that in Las Vegas, that the slot machines are built with computers that have a module that uses like white noise or something as a true random number generator, so that it’s not possible to hack it and hack it. Right, so we could, we could build computers with random number generators, we just don’t, just doesn’t seem to be a need. Okay, so what are we gonna track? How are we gonna make this work? We’re gonna track the following. So we’re gonna have a tuple that contains, so remember an environment, it has basically an environment, has a state it’s in, and then you can tell it to, you can tell it what action you’re gonna take and then you can tell it, show me the next state and it moves to the next state, okay? So each time that happens, I’m going to save off the state that, and remember that was, the state is the eight values we previously talked about.
[00:57:52] Blue: We’re gonna save off what my action was, what the reward was that I received for taking that action at that state and what the next state was and if it ended the session or not, okay? And I’m gonna just save those off somewhere. And that is going to be my data set. So when I’m trying to train my neural net, I’m gonna take that giant list of my whole history of everything I’ve done in my training so far, I’m gonna randomly pick a hundred of them and that’s gonna be my mini training batch that I’m gonna use and it’s a different hundred each time, okay? So there’s an equal possibility of getting good examples and bad examples. Remember, it learns from bad examples, it learns from good examples, we want a mixture of both, that’s why we do it randomly, okay? So initially, all of them will be bad examples, it will learn to avoid bad examples, it will start to get a few good examples and then those will start to occasionally show up and it will allow the neural net to converge to good results over time. So how do we actually do this? Well, I have something called the update hue model function and I’m passing in everything that we just talked about, the state, the action, what the reward is, the new state, if it was done or not. I’m also passing in a gamma, which is just the bell equation needs these little extra formulas, they’re the hyper parameters that you can tweak, you don’t really need to know what it means in this case.
[00:59:17] Blue: And you can see with my code here in Python that I’m taking each of these and I’m putting them into a bunch of arrays that I’m then going to use and here’s the rest of the code. Basically what I’m doing is I’m saying, okay, my predicted values is equal to the model and I’m predicting on the current states, okay? So here we’re predicting an action for the state that we were just in. So remember by the time we’re getting here you’ve now moved to the next state. So this is the state that we were in when we made our move. Then we’re going to say, okay, I wanna predict using the same model on the next state that we ended up in. So that’s gonna get us our next predicted values. Remember you’re doing this on arrays, so you’re doing it on all the previous states and all the previous next states, et cetera, okay? And then I’m going to get actual values by saying take the next predicted values and I’m going to take an arg max on those. So that’s like creating the optimal value function out of the Q function. That’s kind of the same thing you’re doing there. Times by the gamma, which is the discount factor. And then we’re gonna add the rewards in from the rewards array, okay? Now remember that having the rewards like that, that’s what causes convergence over time. So it’s the same idea here as regular reinforcement learning. It’s just that we’re using the neural network’s own predicted values as the ground truth, okay?
[01:01:00] Blue: So we’re creating our supposed actuals and spare quotes, we’re creating our actual values that we’re gonna then use for training by looking at the next state we ended up in and then saying, give me a prediction of what it should have been and then we adjust it by an amount based on the rewards and over time it will converge similar to regular reinforcement learning. There’s the rewards. And then we, this here is, we tell the network to train itself. We say, okay, now given that we have, the previous state arrays, that’s the inputs and the actual values, which is the ward adjusted outputs, predictions for the next state, use that and train yourself. And over time, it magically works. I don’t know what else to say, it just, it does work. So essentially what we’re doing is we’re ripping out the queue table and replacing it with a neural network that takes the states as an input and outputs an action. Okay, now remember on the old queue value table it outputted a value that we then turned into an optimal policy, which then told us which action to take. So we’re kind of skipping a few steps. We’re saying, give me the inputs and output an action. And this effectively replaces the queue table slightly more than the queue table but effectively replaces the queue table. And that’s what the neural nets does. Now, since neural nets generalize well, we don’t have to run a billion games to try to get into every possible state, right? It’s neural nets naturally figure out over time, this state similar to that state. And it starts to assign queue values effectively to states it’s never seen before, but that are mathematically similar to it, okay?
[01:02:54] Blue: So you’ll end up eventually with, if you run it enough times, which is in this case only a few hundred times, you’ll end up with a neural net that effectively returns actions that are just pretty good, right? Now, if you recall from the AlphaGo episode, it is possible that you’ll hit some state it hasn’t seen and that the neural net hasn’t generalized on well and it will hallucinate and it’ll try to do something crazy, right? And that problem still comes up. You’re still a matter of, there’s a certain amount of luck that the Explorix Point trade -off works, it explores enough states, some amount of luck that you happen to pick a good starting point for your neural network and it trains and gets to a good spot. There’s a number of things that really boil down to luck, which is why you have to, it doesn’t always converge well. Sometimes it does, sometimes it doesn’t, you just have to kind of try it until you get a good run, okay? Plus for free, neural nets, because they can take real values in and then output an action, you handle the continuous problem for free, it’s gone. There’s no longer a problem with the continuous infinite space, okay? This is what the neural net gives us by ripping out the queue table and putting the neural net in its place. Now, here is an actual run. Oh,
[01:04:14] Red: interesting.
[01:04:15] Blue: So here’s training. Something that’s really interesting is that like the very first training run is always better than the ones that follow. So I get the feeling, this is for Lunar Lander, I get the feeling that randomly picking an action is not completely terrible for Lunar Lander. And so it initially does all random actions and it scores a negative number but it’s not too negative. And then after that, it thinks it knows something and it doesn’t really and it goes off the deep end and it starts crashing hard. And then it learns from the fact that it’s crashing hard and it starts to do better and then it kind of oscillates back and forth and over time it improves. And if you had it as a running average, it looks like this. And solving this problem is considered at 200. You can see that we’re definitely averaging around 200 at this point, okay? So here is that same agent that I’ve run now after it’s all trained, I run it a hundred times, it’s no longer taking random actions. And you can see I got one crash, even not a terrible crash, but it’s below negative 50, but everything else is positive and most of them were in the 200 range, okay? Even when it doesn’t get 200, it’s still like 150 here or something like that, right? Or a hundred. So that agent is starting to behave rather intelligently. So now, how do I do all this? If you wanna try out my code base, the code base is built using Python and Keras and TensorFlow. The main file is the Q learner interfaces file. Q learner interfaces.py. So this is all in Python.
[01:05:55] Blue: And it contains a set of interfaces for both a regular Q learning or the deep reinforcement learning. So I have the IQ model interface, which is the interface for either a QQ table or a deep deep reinforcement learning, approximation of a Q table. And then we have a learner that it goes with it. You have to, if you’re using a Q table, I have to use a Q learner for using a deep reinforcement learner approximation of it. You have to use the deep reinforcement learning agent instead, interface instead. So like 90 % of the code is all there and it’s still only about 400 lines of code. I think actually it’s a little above 400 now because I’ve changed it since I made this presentation. The code starts with a semi -good set of hyperparameters. So you don’t even initially have to set your hyperparameters. It will just use ones that I know are usually pretty okay. Then you can like kind of tweak it from there and run it over time and get better results. There’s a Q table.py table. Notice that most of these tables have like under 50 lines of code. This is the actual code for a Q table for a classical Q learner that’s built on top of the IQ model interface. The Q learner.py would be the learner that goes with it. There’s an environments.py, which is a wrapper for any environment. Now, why did I have to do that? Well, Open Gym environments. I didn’t want this agent to only be able to work with Open Gym. So I taught it to use this wrapper and then it’s very easy to wrap any Open Gym environment with this wrapper. What I mean by wrapper basically,
[01:07:27] Blue: it’s some sort of interface that says you can reset the environment, you can make a step in the environment or you can render the environment visually. Which are the three things that Open AI Gym needs to be able to do. And really you would need to be able to do for anything, any environment you wanted to create. It’s kind of the minimum necessary interface for an environment. It just simply has functions for that that you can then fill in. And by doing a subclass of it. And from there it allows you to make any environment you want and it kind of creates a standard, if you will. So then I have the DQN model.py and the DQN learner.py. Notice one of those is under 20 lines of code. This is the code that’s equivalent to the Q table and the Q learner, but for the deep version. And then finally we have the Open Gym examples.py which is the one I was running before. This just contains a bunch of examples, mostly commented out. You can comment the minute or out and try out the different examples. It has code for teaching an agent, for running the agent once it’s taught, for things like that. And the reason why I implemented like this is I really wanted to show the relationship between deep learning and regular reinforcement learning, deep reinforcement learning and regular reinforcement learning. And that it really is just a matter of you rip the one part out, the Q table and replace it with a neural network and everything else stays the same. If
[01:08:59] Blue: you’re going to get into this, you would want to look at my code base which is github.com slash Bruce Nielsen, B -R -U -C -E -N -I -E -L -S -O -N reinforcement -learning or just go there and find it. You will find these slides, by the way, there. And then you would want to look at Open AI Gym. You can Google that or it’s gym.openai.com. I give here the link for Lunar Lander. And this is how you would go about that. And then if you really want to learn this, I would recommend the book, Deep Learning with Python by Francois Chalet. He is the author of Keras. Keras is this, so TensorFlow is kind of the Google deep learning library that Google uses in released open source. Keras was built on top of that to make it easy because the Google version was a little harder to understand. And you was really getting into the muck on using matrices and tensors. A matrices would be 2D and tensors would be larger than 2D. But using tensors, which is how we do machine learning. So Keras makes this super simple. In fact, let me show you how simple this actually is. Plus this book is good because it teaches you the whole deep learning field at a very basic level, of course, but it like takes you through everything and it’s a short book. So I’m gonna go to the demo next but let me show you the actual codes so you can see how easy this actually is. Let’s go to the deep Q model. And this code right here that I’m highlighting is this is the neural network code because it uses Keras. It’s only a few lines of code.
[01:10:45] Blue: Wow. They said, hey, here’s this model at sequential. The top layer is 150 nodes. The middle layer is 120 nodes. I got this off the internet, what a good setting was and just started using that. You add these together and then the final layer is just the number of actions. That’s the only thing that changes. So like the difference between the neural net I’m using for Lunar Lander and for CartPole is really just that last layer. Lunar Lander has four possible actions. CartPole has two actions. So you simply change the final layer for the number of actions. Now, if you had a difficult enough problem to solve 150 nodes with a 120 node middle might not be enough anymore. You would have to experiment with other neural networks to find the one that works for your problem. But this one works for both of these problems. And this is it. This is all the deep learning, there’s more. I had to write some code that created the mini batch and I had to write code that did the training things like that, but there’s not much. This is a fairly simple thing. So let’s go ahead and reset this now. And I’m going to run Lunar Lander. Okay, so it takes a second to get started up. Now, as we’re starting that up maybe let’s start talking about how this might apply to what we’ve been talking about with like animal intelligence. Okay, here’s the Lunar Lander by the way. There he goes. Yeah, no, this one’s, no, I’m not dumb. I picked my single best run, saved it off and this is the best agent I was able to find. Okay, I mean, normally it does not get this good.
[01:12:41] Blue: It usually gets competent but not great and still even this one crashes occasionally but this is a really pretty good agent. You can see it comes in for really soft landings.
[01:12:52] Red: Yeah, right.
[01:12:56] Blue: So could deep reinforcement learning maybe explain animal intelligence better than what we’ve talked about up to this point? I think the answer is yes, although I don’t think it takes us very far. So one of the things I’ve emphasized is that regular reinforcement learning using the Bell Equations is just not efficient enough to represent the types of things that animals can do. That it needs thousands and thousands of examples to fill up that cue table that it needs to be able to come up with an optimal policy. The reason why you do it virtually is because a real Lunar Lander every time it crashes to learn it has to crash hundreds of times, thousands of times to learn, that costs money each time. So you don’t wanna do that. The animals would all be dead by the time. It wouldn’t be a good strategy to use reinforcement learning. But in this case, I only had to train this a few hundred times. It was under a thousand. So it was, I don’t remember it was somewhere between like 600 or 900 runs. It just depends on the run, how many times it trains. It eventually gets to the point where it doesn’t do any good anymore. But that’s a lot more reasonable number. And we talked about like with behavior parsing that apes can learn, in this case by watching, which is not the same as reinforcement learning. I wanna emphasize that, but they can learn by watching a few hundred times. Maybe more, maybe fewer, because of the example of the ape that could rock the boat, the orangutan, I guess, that could rock the boat to get the water out. Probably hadn’t seen that a hundred, few hundred times.
[01:14:37] Blue: But reinforcement learning certainly makes this far fewer number of trials to be able to learn. And you don’t really have to fully understand, you have to have a basic understanding of the problem and the state space still. But remember, some of that’s a lot really kind of abstracted away now by the neural network. So I think that gets us, maybe a quarter or not further towards animal intelligence to throw a neural network in there. But even that’s a little misleading because like I said, the neural network, most of the time it doesn’t converge well, you just kind of do it by trial and error. Animals clearly have some sort of efficient algorithm that works very efficiently and works most of the time, finds a good solution by trial and error with very few trial and errors, in fact, right? Fewer than a few hundred in most cases. So I don’t think even with deep reinforcement learning, we can claim that we understand what is going on with animal intelligence, which is why I’ve kind of emphasized the gap of knowledge that exists there, the explanation gap that exists there. But this is, it is interesting that this must be something at least a little more similar to what animals are doing. I’ve heard it said that neural networks are more like artificial intuition than artificial intelligence. And that’s kind of true, right? They kind of get to the point where they have this intuitive, they know what to do, but nobody really understands why. And so it’s probably closer to artificial intuition, which is probably closer to what animals do. They gain an intuition through these trial and error learning systems as to what the right action is.
[01:16:20] Blue: They don’t necessarily have some sort of explanatory understanding of what they’re doing. They’re kind of inputs and outputs and they’re reacting in some way based on reward systems that are in their valence system and the pain they feel and the pleasures they feel and things like that. And their exploit trade -off comes from their curiosity. Animals are given a sort of natural curiosity and that allows them to explore and to have this export trade -off. And it’s not an accident that there’s similarities in wording between animals and reinforcement learning because of course, computer reinforcement learning was an idea that was computer scientists going, hey, I wanna do something similar to animal reinforcement learning. And they ended up butchering it. They ended up making it something totally different. They used the terms wrong. Michael Lippmann, one of my teachers talks about how anytime he tries to talk to professors that work with animals and he tries to explain reinforcement learning, they can’t understand him because they cannot help but impose their own fields meanings on the words that are being used but they’re being used in entirely different ways. Right, right. So, but you can see that they were trying to mimic animal learning with reinforcement learning. And this is how far we got, right? It’s not great, but it’s not completely terrible either, you know? It’s, we’re getting something interesting even if it’s not really animal learning yet, right? So anyhow, that is our episode for today. So lots of, I’ll have to go back and eventually try to add more open AI gyms to this and see how far I can take this. But I found this, I was very fascinated by the idea of deep reinforcement learning.
[01:18:07] Blue: This whole idea that you can rip out the cue table and replace it with a neural neck and, you know, by magic have it train itself and it will come up with a good result. Better than a regular reinforcement learning can do in some cases.
[01:18:21] Red: It’s pretty fascinating and it’ll be interesting to watch how it all unfolds as we get more sophisticated because there’s obviously things going on that we don’t completely understand.
[01:18:34] Blue: Right. Yeah, you know, I would, I would love to be, you know involved with figuring out what animal intelligence really is because it would be some sort of super cool algorithm that, that we have yet to discover that would be able to do things that we just can’t even imagine right now, right? So it would be very nice. And, but when we know the algorithm must exist because animals are doing it, right? Evolution found it. It can’t even be a complex algorithm, right?
[01:19:02] Red: Yeah.
[01:19:02] Blue: So it would be interesting. And it seems like one of the main things that animals do that I, it’s just hard to figure out how to do well is they can just abstract things. Like when you’re talking about animal learning there’s this kind of starting assumptions. I’m currently reading the book, Animal Learning Theory. And I got, it was a textbook recommended by Richard Burns since I had been reading Richard Burns books and we were making episodes on that. He said, this book is the standard, is the standard in understanding animal learning theory. So I bought that book and started reading it because I was curious. And the thing I noticed is they kind of just start with the assumption that animals can somehow generalize or abstract. And there’s no discussion about it.
[01:19:50] Red: That’s a heck of an assumption though.
[01:19:52] Blue: Right, right. But we’re so used to it. We know animals can do that. So it’s just a given that they can do it and all animals can do it. But the very fact that I can put something in the animal’s food that makes it feel sick. And now it’s going to know that I shouldn’t eat that food again. You’re never dealing with exactly the same circumstances. It somehow abstracts between them. And it’s just an assumption that they’re able to do this and yet they can. And that’s what is actually so hard with machine learning is the amount of human labeled ground truth labels we have to put together to try to make an abstraction like that. And then it easily gets defeated by adversarial examples or something like that, right? It’s, we really haven’t, don’t understand the idea of abstracting things very well. And animals somehow just do it and it’s amazing. And we never think about it as amazing because we all know animals can do it as just a given. But that’s the thing that would be interesting to try to figure out. And unfortunately, me reading animal learning theory, I learned lots of interesting things, but since the most interesting part is a given, I don’t think I can learn the thing I really need to learn if I’m ever going to try to program it by reading this book. So anyhow, that’s my thoughts on how reinforcement learning ties into animal learning.
[01:21:18] Red: Well, fascinating subject and really a lot of interesting things happening here.
[01:21:24] Blue: Yeah, all right. Well, thank you, cameo. See you next time, Bruce. All right, bye -bye.
Links to this episode: Spotify / Apple Podcasts
Generated with AI using PodcastTranscriptor. Unofficial AI-generated transcripts. These may contain mistakes; please verify against the actual podcast.