Reinforcement Learning in Python

Reinforcement Learning in Python

Labelling large amounts of data is the main problem of traditional supervised machine learning. At the same time, even if there is a sufficient volume of information available, direct human control for the supervised ML model is necessary. And what could happen if artificial intelligence could be trained without these requirements?

The best example is Alpha Go (created by Google DeepMind). The program studied optimal game strategies using a combination of deep learning and reinforcement learning. As a result of this training, Alpha Go won a game of Go (one of the most challenging games that require deep logical thinking) from the strongest player in the world, Lee Sedol. At the same time, Lee Sedol was quite surprised by the moves that the program made. He thought that the moves would be made considering the calculation of various probabilities, but during the game, the program used a creative approach, which he could not cope with. Each subsequent game was given to him more and more complex, which shows the level with which the program is being improved.

Thanks to reinforcement learning, the need for labelling a large volume of data is eliminated. At the same time, various data are optimised, resulting in learning that repeats the human. In simple terms, this learning process is based on 2 essences, namely the agent and the environment.

The cycle looks like this: the agent receives a particular state, after which it sends an action to the environment. It, in turn, sends a reward in case of the correct answer. By maximizing the reward, model is learning how to improve the algorithm.

Environment and Agents

The agent must make decisions based on previous experience. The environment is something that the agent cannot control, while the agent can interact with it. The environment changes when the agent makes its choice. It is also worth noting that the algorithms used to calculate the rewards are integrated into the environment.


The key task of the agent is to gain the maximum reward. The prize takes the form of a score that varies depending on the algorithm’s success. The agent evaluates the losses from the applied actions and then chooses the action that should bring the maximum reward. In the future, he will choose the action that will bring the maximum reward. The simpler the reward system, the less control over the agent. However, one question arises, namely: how can an agent predict the reward received for the actions performed?

Episodic Play

Creating a specific function that will simulate the expected future rewards in a single episode is necessary. These rewards are configured as the agent performs specific actions until it converges on the “true” rewards for each state that were set by the environment.

There are many different ways to gain the reward for the algorithms, which mainly depend on the environment and its complexity. So Alpha Go uses deep q-learning, which implements neural networks that help to predict the expected reward. This forecast is based on a random sample of previous profitable actions.

Raise Your Business's Potential by Learning Python

In fact, a certain percentage of the difference between the target (actual) awards and the initial (expected) ones for a specific condition is added. Those who are familiar with classical machine learning models recognise these losses as the learning rate. The higher the percentage, the more likely it is that the true goal will be overestimated, but the model will approach the target reward much faster.

It is also worth noting that the status bar should not be equal to 0, since the agent will not learn lessons from this state in this case.

Explore vs. Exploit

The agent always takes actions that should lead to the greatest reward. However, what if a smaller reward later brings more reward? In the understanding of a person, these are “long-term benefits” (buying shares, not a car, which will bring much more benefits in the future). This dilemma is called “explore or use”.

So an agent who regularly performs actions that will bring the maximum reward will most likely never find another solution that will be much better suited for specific tasks. At the same time, an agent who regularly uses various methods of solving problems will spend a lot of time searching for the most effective solution. That is why most learning algorithms will try a combination of use and study. Over time, when the agent has found the best options, you can reduce the time to research the best options, as the agent learns and is better optimized for use, which allows you to find optimal solutions much faster, while much depends on the complexity of the environment.

Putting it all Together

For each task that needs to be solved, the agent needs to create a state-reward pair, as well as specify the environment. After the agent receives this pair, he will have to remember the updated status and reward before solving the necessary tasks.

After completing the task, the agent needs to view the status history, as well as update the table using the training algorithm used. It is necessary to run the code several times while changing the learning rate and other factors to further analyze it and improve its performance. If you wish to learn more about AI best practices, check this ML blog.