[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This example is drawn from “A Painless Q-Learning Tutorial” at http://mnemstudio.org/path-finding-q-learning-tutorial.htm which explains how to manually calculate iterations using the updating equation for Q-Learning, based on the Bellman Equation (image from https://www.is.uni-freiburg.de/ressourcen/business-analytics/13_reinforcementlearning.pdf):

The example explores path-finding through a house:

The question to be answered here is: What’s the best way to get from Room 2 to Room 5 (outside)? Notice that by answering this question using reinforcement learning, we will also know how to find optimal routes from any room to outside. And if we run the iterative algorithm again for a new target state, we can find out the optimal route from any room to that new target state.

Since Q-Learning is model-free, we don’t need to know how likely it is that our agent will move between any room and any other room (the transition probabilities). If you had observed the behavior in this system over time, you might be able to find that information, but it many cases it just isn’t available. So the key for this problem is to construct a Rewards Matrix that explains the benefit (or penalty!) of attempting to go from one state (room) to another.

Assigning the rewards is somewhat arbitrary, but you should give a large positive value to your target state and negative values to states that are impossible or highly undesirable. Here’s the guideline we’ll use for this problem:

• -1 if “you can’t get there from here”
• 0 if the destination is not the target state
• 100 if the destination is the target state

We’ll start constructing our rewards matrix by listing the states we’ll come FROM down the rows, and the states we’ll go TO in the columns. First, let’s fill the diagonal with -1 rewards, because we don’t want our agent to stay in the same place (that is, move from Room 1 to Room 1, or from Room 2 to Room 2, and so forth).

Next, let’s move across the first row. Starting in Room 0, we only have one choice: go to Room 4. All other possibilities are blocked (-1):

Now let’s fill in the row labeled 1. From Room 1, you have two choices: go to Room 3 (which is not great but permissible, so give it a 0) or go to Room 5 (the target, worth 100)!

Continue moving row by row, determining if you can’t get there from here (-1), you can but it’s not the target (0), or it’s the target(100). You’ll end up with a final rewards matrix that looks like this:

Now create this rewards matrix in R:

```R <- matrix(c(-1, -1, -1, -1, 0, 1,
-1, -1, -1, 0, -1, 0,
-1, -1, -1, 0, -1, -1,
-1, 0, 0, -1, 0, -1,
0, -1, -1, 0, -1, 0,
-1, 100, -1, -1, 100, 100), nrow=6, ncol=6, byrow=TRUE)
```

And run the code. Notice that we’re calling the target state 6 instead of 5 because even though we have a room labeled with a zero, our matrix starts with a 1s so we have to adjust:

```source("https://raw.githubusercontent.com/NicoleRadziwill/R-Functions/master/qlearn.R")

results <- q.learn(R,10000,alpha=0.1,gamma=0.8,tgt.state=6)
> round(results)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0   64   80
[2,]    0    0    0   51    0   80
[3,]    0    0    0   51    0    0
[4,]    0   64   41    0   64    0
[5,]   64    0    0   51    0   80
[6,]    0   84    0    0   84  100
```

You can read this table of average value to obtain policies. A policy is a “path” through the states of the system:

• Start at Room 0 (first row, labeled 1): Choose Room 4 (80), then from Room 4 choose Room 5 (100)
• Start at Room 1: Choose Room 5 (100)
• Start at Room 2: Choose Room 3 (64), from Room 3 choose Room 1 or Room 4 (80); from 1 or 4 choose 5 (100)
• Start at Room 3: Choose Room 1 or Room 4 (80), then Room 5 (100)
• Start at Room 4: Choose Room 5 (100)
• Start at Room 5: Stay at Room 5 (100)

To answer the original problem, we would take route 2-3-1-5 or 2-3-4-5 to get out the quickest if we started in Room 2. This is easy to see with a simple map, but is much more complicated when the maps get bigger.

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)