Restricted Boltzmann Machines in R

[This article was first published on Statistically Significant, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence. They have been proven useful in collaborative filtering, being one of the most successful methods in the Netflix challenge (paper). Furthermore, they have been tantamount to training deep learning models, which appear to be the best current models for image and digit recognition.

I do not want to go into too much detail, but I would like to give a high level overview of RBMs. Edwin Chen has an introduction that is much better. The usual version of an RBM is a probabilistic model for binary vectors. An image can be represented as a binary vector if each pixel that is dark enough is represented as a 1 and the non-dark pixels are 0’s. In addition to the visible binary vector, it is assumed that there is a hidden binary vector, and each element of the hidden unit is connected to each unit of the visible unit by symmetric weights. The weights are represented by the matrix W, where the i,jth component is the weight between hidden unit i and visible unit j. It is important that there are no connections between hidden units or between visible units. The probability that visible unit j is 1 is the inverse logistic function of the sum of the weights connected to visible unit j, in which the hidden units are 1. In math notation (where σ is the inverse logistic, or sigmoid, function):

Pr(vj=1|h,W)=σ(ihiWi,j).

The weights are symmetric, so

Pr(hi=1|v,W)=σ(jvjWi,j).
As you can see, the model is defined by its conditional probabilities. The task is to find the weight matrix W that best explains the visible data for a given number of hidden units.

I have been taking Geoff Hinton’s coursera course on neural networks, which I would recommend to anyone interested. One of the programming assignments was to fill in code to write an RBM in Matlab/Octave. I have since tried to find a version for R, but have not had any luck, so I decided to translate the code myself. Below is the code to calculate the weight matrix. It is fairly simple and I only use contrastive divergence 1. The input data is p×n instead of the usual transpose.

# rbm_w is a matrix of size <number of hidden units> by <number of visible units>
# visible_state is matrix of size <number of visible units> by <number of data cases>
# hidden_state is a binary matrix of size <number of hidden units> by <number of data cases>
 
visible_state_to_hidden_probabilities <- function(rbm_w, visible_state) {
  1/(1+exp(-rbm_w %*% visible_state))
}
 
hidden_state_to_visible_probabilities <- function(rbm_w, hidden_state) {
  1/(1+exp(-t(rbm_w) %*% hidden_state))
}
 
configuration_goodness <- function(rbm_w, visible_state, hidden_state) {
  out=0
  for (i in 1:dim(visible_state)[2]) {
    out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i]
  }
  out/dim(visible_state)[2]
}
 
configuration_goodness_gradient <- function(visible_state, hidden_state) {
  hidden_state %*% t(visible_state)/dim(visible_state)[2]
}
 
sample_bernoulli <- function(mat) {
  dims=dim(mat)
  matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2])
}
 
cd1 <- function(rbm_w, visible_data) {
  visible_data = sample_bernoulli(visible_data)
  H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data))
  vh0=configuration_goodness_gradient(visible_data, H0)
  V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0))
  H1=visible_state_to_hidden_probabilities(rbm_w, V1)
  vh1=configuration_goodness_gradient(V1, H1)
  vh0-vh1
}
 
rbm <- function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) {
#   This trains a model that's defined by a single matrix of weights.
#   <num_hidden> is the number of hidden units
#   cd1 is a function that takes parameters <model> and <data> and returns the gradient (or approximate gradient in the case of CD-1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided <model> parameter.
#   This uses mini-batches no weight decay and no early stopping.
#   This returns the matrix of weights of the trained model.
  n=dim(training_data)[2]
  p=dim(training_data)[1]
  if (n %% mini_batch_size != 0) {
    stop("the number of test cases must be divisable by the mini_batch_size")
  }
  model = (matrix(runif(num_hidden*p),num_hidden,p) * 2 - 1) * 0.1
  momentum_speed = matrix(0,num_hidden,p)
 
  start_of_next_mini_batch = 1;
  for (iteration_number in 1:n_iterations) {
    if (!quiet) {cat("Iter",iteration_number,"\n")}
    mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size - 1)]
    start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n
    gradient = cd1(model, mini_batch)
    momentum_speed = momentum * momentum_speed + gradient
    model = model + momentum_speed * learning_rate
  }
  return(model)
}


I loaded the hand written digit data that was given in the class. To train the RBM, use the syntax below.

weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000,
            mini_batch_size=100, momentum=0.9)


After training the weights, I can visualize them. Below is a plot of strength of the weights going to each pixel. Each facet displays the weights going to/coming from one of the hidden units. I trained 30 units so that it would be easy to show them all on one plot. You can see that most of the hidden units correspond strongly to one digit or another. I think this means it is finding the joint structure of all of the pixels and that pixels representing those numbers are darkened together often.

library(ggplot2)
library(reshape2)
mw=melt(weights); mw$Var3=floor((mw$Var2-1)/16)+1; mw$Var2=(mw$Var2-1)%%16 + 1; mw$Var3=17-mw$Var3;
ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+
  scale_fill_continuous(low='white',high='black')+coord_equal()+
  labs(x=NULL,y=NULL,title="Visualization of Weights")+
  theme(legend.position="none")


To leave a comment for the author, please follow the link and comment on their blog: Statistically Significant.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)