Reinforcement Learning: Introduction
Reinforcement Learning is a scheme of training machine learning models in which a certain agent’s actions in an environment (typically in the form of moves in a game played by the agent) are adjusted over time. Adjustments are made by reinforcing those which lead to a good outcome (reward for winning), possibly suppressing those which lead to a bad outcome (punishment for losing).
A group of authors from DeepMind recently published a paper formulating the Reward-is-Enough hypothesis (Silver et al., 2021). The authors claim that the reward maximization mechanism may well be enough to explain the phenomenon of intelligence. This would mean the Reinforcement Learning framework, an embodiment of reward maximization, may be broad enough to encompass artificial general intelligence. With this in mind, let’s explore a simple and explicit example of Reinforcement Learning.
- Reinforcement Learning
- What (or who) is TicTacJoe?
- Why is TicTacJoe interesting?
- TicTacJoe’s state of mind
- Possibilities with reinforcement learning
What (or who) is TicTacJoe?
TicTacJoe is a Reinforcement Learning agent operating in the game of Tic-Tac-Toe (you can play around with it here). To play against TicTacJoe, click on the “Play a game” button. The probabilities of moves that TicTacJoe makes in a given round are displayed on the tiles before TicTacJoe picks one of them. Mind you, only nonsymmetric choices are considered. As you can see, when TicTacJoe makes his first move, all three choices are equally likely.
The interesting thing about TicTacJoe is that he can learn from playing the game. Click “Let TicTacJoe train” to see how he gradually becomes better at the game. He starts as a “Young Padawan” and eventually becomes a “Guru” – by then, almost always picking one of the optimal moves.
What’s TicTacJoe doing?
What actually happens under the hood? When TicTacJoe is in training, he plays multiple games against himself: 10,000 to become a Guru. Each time, we reward moves made by the agent which won a given game while discouraging the moves of the other agent (this is the reward mechanism in action!). By introducing a simple temperature-like mechanism, TicTacJoe explores all possibilities of moves. The mechanism prevents TicTacJoe from fixating on a given strategy too soon. You can find more details of the implementation below.
Play with TicTacJoe here!
A set of 3 graphs provided after launching the training shows TicTacJoe’s learning curve. They show how the three likelihoods of TicTacJoe’s first move evolve after each of the 10,000 games he plays to reach the top skill level.
Interested in Convolutional Neural Networks? Read our Introduction to Convolutional Neural Networks.
Why is TicTacJoe interesting?
TicTacJoe is a simple creature. He’s interesting because the inner workings of his brain have been coded explicitly in R, with no extra packages used. This makes TicTacJoe accessible for easier inspection. The code is available here. Read on to learn how it all works.
TicTacJoe’s state of mind
In this section, we dissect TicTacJoe’s brain to see how it functions.
Amplify your business with Appsilon’s custom Machine Learning solutions
We can show everything TicTacJoe knows about playing Tic-Tac-Toe in a graph. The graph represents the likelihood of TicTacJoe picking a given move when faced with a certain board configuration. A graph holding all such possibilities would have 9! = 362 880 nodes, with some pruning possible, since no further moves can be made after a game is won. To fit this information in the memory, we chose to reduce the graph in terms of symmetry. That’s why only 3 nonsymmetric options are available in the first move: corner, side, and center. After the reduction, there are only 765 nodes in our graph!
When initialized, the likelihood assigned to each edge is equally distributed among the edges with the same origin state on the board. But as the training progresses, moves that lead to losses are discouraged and less likely. Those which are beneficial to the agent are encouraged and have an increased likelihood.
There’s another trick that stabilizes training. A temperature-like mechanism is introduced so that the updated probabilities are processed with a softmax function dependent on a temperature parameter. This parameter gradually decreases from a high value to a low value. The high value encourages TicTacJoe to explore new moves, while the low value forces him to exploit his experience.
Possibilities with reinforcement learning
Interestingly, the approach presented above is possible in simple games like Tic-Tac-Toe, where (with some extra tricks, like symmetry reduction) all possible states of the board and their links can be stored in memory. However, this approach doesn’t scale to larger environments. In such cases, the agent needs to read the state of the environment and make a decision based on the outcome of this perception (possibly enriched with the history of such perceptions).
Let’s build something beautiful together
Appsilon provides innovative data analytics, machine learning, and managed services solutions for Fortune 500 companies, NGOs, and non-profit organizations. We deliver the world’s most advanced R Shiny applications, with a unique ability to rapidly develop and scale enterprise Shiny dashboards. Our proprietary machine learning frameworks allow us to deliver Computer Vision, NLP, and fraud detection prototypes in as little as one week.
Discover our growing list of open-source R Shiny packages at shiny.tools. If you find our open-source packages useful please consider dropping a star on your favorites at our Github. It helps let us know we’re on the right track. And if you have any comments or suggestions, swing by our feedback threads like the discussion at our new shiny.fluent package or submit a pull request. We value the R community’s input.