Site icon R-bloggers

Reinforcement Learning: Q-Learning with the Hopping Robot

[This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview: Reinforcement learning uses “reward” signals to determine how to navigate through a system in the most valuable way. (I’m particularly interested in the variant of reinforcement learning called “Q-Learning” because the goal is to create a “Quality Matrix” that can help you make the best sequence of decisions!) I found a toy robot navigation problem on the web that was solved using custom R code for reinforcement learning, and I wanted to reproduce the solution in different ways than the original author did. This post describes different ways that I solved the problem described at

The Problem: Our agent, the robot, is placed at random on a board of wood. There’s a hole at s1, a sticky patch at s4, and the robot is trying to make appropriate decisions to navigate to s7 (the target). The image comes from the blog post linked above.

To solve a problem like this, you can use MODEL-BASED approaches if you know how likely it is that the robot will move from one state to another (that is, the transition probabilities for each action) or MODEL-FREE approaches (you don’t know how likely it is that the robot will move from state to state, but you can figure out a reward structure).

Solving a RL problem involves finding the optimal value functions (e.g. the Q matrix in Attempt 1) or the optimal policy (the State-Action matrix in Attempt 2). Although there are many techniques for reinforcement learning, we will use Q-learning because we don’t know the transition probabilities for each action. (If we did, we’d model it as a Markov Decision Process and use the MDPtoolbox package instead.) Q-Learning relies on traversing the system in many ways to update a matrix of average expected rewards from each state transition. This equation that it uses is from

For this to work, all states have to be visited a sufficient number of times, and all state-action pairs have to be included in your experience sample. So keep this in mind when you’re trying to figure out how many iterations you need.

Attempt 1: Quick Q-Learning with qlearn.R

Set up the rewards matrix so it is a square matrix with all the states down the rows, starting with the first and all the states along the columns, starting with the first:

hopper.rewards <- c(-10, 0.01, 0.01, -1, -1, -1, -1,
         -10, -1, 0.1, -3, -1, -1, -1,
         -1, 0.01, -1, -3, 0.01, -1, -1,
         -1, -1, 0.01, -1, 0.01, 0.01, -1,
         -1, -1, -1, -3, -1, 0.01, 100,
         -1, -1, -1, -1, 0.01, -1, 100,
         -1, -1, -1, -1, -1, 0.01, 100)

HOP <- matrix(hopper.rewards, nrow=7, ncol=7, byrow=TRUE) 
     [,1]  [,2]  [,3] [,4]  [,5]  [,6] [,7]
[1,]  -10  0.01  0.01   -1 -1.00 -1.00   -1
[2,]  -10 -1.00  0.10   -3 -1.00 -1.00   -1
[3,]   -1  0.01 -1.00   -3  0.01 -1.00   -1
[4,]   -1 -1.00  0.01   -1  0.01  0.01   -1
[5,]   -1 -1.00 -1.00   -3 -1.00  0.01  100
[6,]   -1 -1.00 -1.00   -1  0.01 -1.00  100
[7,]   -1 -1.00 -1.00   -1 -1.00  0.01  100

Here’s how you read this: the rows represent where you’ve come FROM, and the columns represent where you’re going TO. Each element 1 through 7 corresponds directly to S1 through S7 in the cartoon above. Each cell contains a reward (or penalty, if the value is negative) if we arrive in that state.

The S1 state is bad for the robot… there’s a hole in that piece of wood, so we’d really like to keep it away from that state. Location [1,1] on the matrix tells us what reward (or penalty) we’ll receive if we start at S1 and stay at S1: -10 (that’s bad). Similarly, location [2,1] on the matrix tells us that if we start at S2 and move left to S1, that’s also bad and we should receive a penalty of -10. The S4 state is also undesirable – there’s a sticky patch there, so we’d like to keep the robot away from it. Location [3,4] on the matrix represents the action of going from S3 to S4 by moving right, which will put us on the sticky patch

Now load the qlearn command into your R session:

qlearn <- function(R, N, alpha, gamma, tgt.state) {
# Adapted from
  Q <- matrix(rep(0,length(R)), nrow=nrow(R))
  for (i in 1:N) {
    cs <- sample(1:nrow(R), 1)
    while (1) {
      next.states <- which(R[cs,] > -1)  # Get feasible actions for cur state
      if (length(next.states)==1)        # There may only be one possibility
        ns <- next.states
        ns <- sample(next.states,1) # Or you may have to pick from a few if (ns > nrow(R)) { ns <- cs }
      Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
      if (ns == tgt.state) break
      cs <- ns

Run qlearn with the HOP rewards matrix, a learning rate of 0.1, a discount rate of 0.8, and a target state of S7 (the location to the far right of the wooden board). I did 10,000 episodes (where in each one, the robot dropped randomly onto the wooden board and has to get to S7):

r.hop <- qlearn(HOP,10000,alpha=0.1,gamma=0.8,tgt.state=7) 
> r.hop
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0   51   64    0    0    0    0
[2,]    0    0   64    0    0    0    0
[3,]    0   51    0    0   80    0    0
[4,]    0    0   64    0   80   80    0
[5,]    0    0    0    0    0   80  100
[6,]    0    0    0    0   80    0  100
[7,]    0    0    0    0    0   80  100

The Q-Matrix that is presented encodes the best-value solutions from each state (the “policy”). Here’s how you read it:

Alternatively, the policy can be expressed as the best action from each of the 7 states: HOP, RIGHT, HOP, RIGHT, HOP, RIGHT, (STAY PUT)

Attempt 2: Use ReinforcementLearning Package

I also used the ReinforcementLearning package by Nicholas Proellochs (6/19/2017) described in

First, I created an “environment” that describes 1) how the states will change when actions are taken, and 2) what rewards will be accrued when that happens. I assigned a reward of -1 to all actions that are not special, e.g. landing on S1, landing on S4, or landing on S7. To be perfectly consistent with Attempt 1, I could have used 0.01 instead of -1, but the results will be similar. The values you choose for rewards are sort of arbitrary, but you do need to make sure there’s a comparatively large positive reward at your target state and “negative rewards” for states you want to avoid or are physically impossible.

y.env <- function(state,action) {
   next_state <- state
   if (state == state("s1") && action == "right")  { next_state <- state("s2") }
   if (state == state("s1") && action == "hop")    { next_state <- state("s3") }

   if (state == state("s2") && action == "left")  {
	next_state <- state("s1"); reward <- -10 }
   if (state == state("s2") && action == "right") { next_state <- state("s3") }
   if (state == state("s2") && action == "hop")   {
	next_state <- state("s4"); reward <- -3 }

   if (state == state("s3") && action == "left")  { next_state <- state("s2") }
   if (state == state("s3") && action == "right") {
	next_state <- state("s4"); reward <- -3 }
   if (state == state("s3") && action == "hop")   { next_state <- state("s5") }

   if (state == state("s4") && action == "left")  { next_state <- state("s3") }
   if (state == state("s4") && action == "right") { next_state <- state("s5") }
   if (state == state("s4") && action == "hop")   { next_state <- state("s6") }

   if (state == state("s5") && action == "left")  {
	next_state <- state("s4"); reward <- -3 }
   if (state == state("s5") && action == "right") { next_state <- state("s6") }
   if (state == state("s5") && action == "hop")   {
	next_state <- state("s7"); reward <- 10 }

   if (state == state("s6") && action == "left")  { next_state <- state("s5") }
   if (state == state("s6") && action == "right") {
	next_state <- state("s7"); reward <- 10 }

   if (next_state == state("s7") && state != state("s7")) {
        reward <- 10
   } else {
	reward <- -1
   out <- list(NextState = next_state, Reward = reward)

Next, I installed and loaded up the ReinforcementLearning package and ran the RL simulation:

states <- c("s1", "s2", "s3", "s4", "s5", "s6", "s7")
actions <- c("left","right","hop")
data <- sampleExperience(N=3000,env=my.env,states=states,actions=actions)
control <- list(alpha = 0.1, gamma = 0.8, epsilon = 0.1)
model <- ReinforcementLearning(data, s = "State", a = "Action", r = "Reward", 
      s_new = "NextState", control = control)

Now we can see the results:

> print(model)
State-Action function Q
         hop     right      left
s1  2.456741  1.022440  1.035193
s2  2.441032  2.452331  1.054154
s3  4.233166  2.469494  1.048073
s4  4.179853  4.221801  2.422842
s5  6.397159  4.175642  2.456108
s6  4.217752  6.410110  4.223972
s7 -4.602003 -4.593739 -4.591626

     s1      s2      s3      s4      s5      s6      s7
  "hop" "right"   "hop" "right"   "hop" "right"  "left" 

Reward (last iteration)
[1] 223

The recommended policy is: HOP, RIGHT, HOP, RIGHT, HOP, RIGHT, (STAY PUT)

If you tried this example and it didn’t produce the same response, don’t worry! Model-free reinforcement learning is done by simulation, and when you used the sampleExperience function, you generated a different set of state transitions to learn from. You may need more samples, or to tweak your rewards structure, or both.)

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.