OpenAI created a toolkit, OpenAI Gym, for AI-enthusiasts to test their artificial intelligence on. OpenAI provides environments such as old atari games to test AIs. So I got experimenting and made this through Gym's lunar lander module:

The AI is given sensor readings such as altitude, velocity, and angular rotation and outputs a number that corresponds to an action: activate left side booster, activate right side booster, turn on main booster, and do nothing.

The AI

The algorithm I used is a neural network that learns through policy gradients. Neural networks are function approximators and mimic how neurons in the human brain work. As such, they can be applied to a wide variety of problems such as banking and medicine, making them all the rage in the current world of AI. Potentially the basis for superintelligent AI? Maybe...

With policy gradients, the neural network outputs a probability distribution of what it thinks the best action is (what would maximize its rewards).

How Policy Gradients Works (One Variation)

Policy gradients optimizes the parameters of a policy through gradient descent. In this case, I used a popular variation of policy gradients that was introduced by Ronald Williams all the way back in 1992.

With this class of policy gradients, we give the neural network the sensor readings of the spacecraft, and the neural network outputs a probability distribution of what the neural network thinks the best action is (i.e. 60% confident main booster is the best action, 20% confident left side booster is the best action, etc).

We look at the action that the AI thinks is the best one (main booster), and compute the gradient that would make the chosen action more likely (by labeling the chosen action as the correct action and backpropagating the error).

Here’s the magic of policy gradients. Before we apply the gradient, we multiply the gradient by the reward that the AI earns for that action.

policy gradient process

Think about that for a second. Say an action is a good one, then the reward is positive and large in magnitude. By multiplying it by the gradient, it increases the gradient for picking that action and makes the AI more confident that this is a good decision. If, however, the action is a bad one, then the reward is negative and as a result, the gradient is reversed, making the AI less confident in that action and less likely to choose it in similar scenarios.

In Reality

Just one problem. The AI is still, frankly, dumb as nails. In practice, many reinforcement learning algorithms suffer from numerous problems such as the credit assignment problem (delayed rewards) and other factors arising from the complex nature of playing games in a smart and strategic way. So let’s fix that.


We would first normalize all the rewards by subtracting by the mean and dividing by standard deviation so large rewards don't push the weights around as violently. Imagine if the AI lands and earns a reward of 100, the gradient would be massive! It would be like crushing a puppy under a metric ton of treats for successfully fetching a frisbee.

Let's avoid the bloodbath.

Furthermore, just like normalization of data helps neural networks converge, normalization of rewards makes convergence easier and less chaotic.


Right now the AI will only consider what is best at the current time step and not over the entire game (thinking short-term instead of long term) and as such, is likely to fall into local maximums. If, however, we add a bit of the rewards it earns in the next time step then it will think more about future and make future rewards more important.

This is called discounting.

Each time we play a game, we keep track of the rewards it earns. After the game finishes, we start from the last time step to the first one, adding the reward it earns at the next time step multiplied by a discount rate to the current time step.

how discounting in neural networks work

Exploration vs Exploitation

Should the AI explore its choices or stick to its guns and perfect its current actions? Practice 10,000 kicks or perfect one kick? Right now since we are labeling all the AI’s most confident actions as the correct one and accessing them by multiplying their gradient by the reward, if the AI finds an action it really likes and works decently, then it will continue to choose that action.

As a result, it will only go down that path; however, we need the AI to not just go down that one path (perfect that one kick) but explore a bit (try a new kick). One thing we could do is label a random action as the right one instead and calculate the gradient based on that and multiplying it by the reward.

Say the AI gives turning on the main booster 0.8 and the right side booster 0.2. Then we should randomly pick main booster or side booster with probability of turning on the main booster being 80% and the right side booster being 20%. This gives the AI room to explore while making the AI's more confident decisions more likely. We can do this with tensorflow’s multinomial function.

With policy gradients, knowing how they work just isn't enough. A practical application of policy gradients require so many other concepts jumbled together to form an AI that actually works. But such is the messiness of everyday life. However with that over, we can actually start making the AI a reality!

The Whole Process

Let’s look specifically the process we take to train the policy gradient

  1. Initialize a 4-layer neural network with random weights
  2. Play a couple of games, each time recording the rewards and the gradients
  3. Apply discounting and normalization to the rewards. (I found having a discount rate of 0.95 is too low and the AI becomes stuck in a local maximum, hovering several pixels above the ground, unwilling to face the prospect of crashing. So I used 0.99 instead, forcing the AI to face the end game: land or crash.)
  4. Multiply the rewards with the gradients
  5. Average all the gradients and apply the single averaged gradient to the neural network through gradient descent, in this case, Adam Optimization
  6. Repeat steps 2 through 6

In Conclusion

Here we see some of the power of neural networks used in conjunction with policy gradients. Using only the raw data from the game such as linear velocity and with no understanding of rockets or gravity, neural networks can actually learn to land spacecraft. I had friends try playing the lunar lander game but their lunar lander tipped over or smashed into the ground every single time. That’s incredible, and it just proves how powerful these algorithms really can be.

Of course, the full code is on github. I am still very much a beginner to reinforcement learning and based much of the code on the book, Hands on Machine Learning with Scikit-Learn and Tensorflow, which is a great resource to learn machine learning and AI. Hope you enjoyed this post!