**Statistically Significant**, and kindly contributed to R-bloggers)

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence. They have been proven useful in collaborative filtering, being one of the most successful methods in the Netflix challenge (paper). Furthermore, they have been tantamount to training deep learning models, which appear to be the best current models for image and digit recognition.

I do not want to go into too much detail, but I would like to give a high level overview of RBMs. Edwin Chen has an introduction that is much better. The usual version of an RBM is a probabilistic model for binary vectors. An image can be represented as a binary vector if each pixel that is dark enough is represented as a 1 and the non-dark pixels are 0’s. In addition to the visible binary vector, it is assumed that there is a hidden binary vector, and each element of the hidden unit is connected to each unit of the visible unit by symmetric weights. The weights are represented by the matrix

The weights are symmetric, so

As you can see, the model is defined by its conditional probabilities. The task is to find the weight matrix

I have been taking Geoff Hinton’s coursera course on neural networks, which I would recommend to anyone interested. One of the programming assignments was to fill in code to write an RBM in Matlab/Octave. I have since tried to find a version for R, but have not had any luck, so I decided to translate the code myself. Below is the code to calculate the weight matrix. It is fairly simple and I only use contrastive divergence 1. The input data is

# rbm_w is a matrix of sizeby

# visible_state is matrix of sizeby

# hidden_state is a binary matrix of sizeby

visible_state_to_hidden_probabilities <- function(rbm_w, visible_state) {

1/(1+exp(-rbm_w %*% visible_state))

}

hidden_state_to_visible_probabilities <- function(rbm_w, hidden_state) {

1/(1+exp(-t(rbm_w) %*% hidden_state))

}

configuration_goodness <- function(rbm_w, visible_state, hidden_state) {

out=0

for (i in 1:dim(visible_state)[2]) {

out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i]

}

out/dim(visible_state)[2]

}

configuration_goodness_gradient <- function(visible_state, hidden_state) {

hidden_state %*% t(visible_state)/dim(visible_state)[2]

}

sample_bernoulli <- function(mat) {

dims=dim(mat)

matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2])

}

cd1 <- function(rbm_w, visible_data) {

visible_data = sample_bernoulli(visible_data)

H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data))

vh0=configuration_goodness_gradient(visible_data, H0)

V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0))

H1=visible_state_to_hidden_probabilities(rbm_w, V1)

vh1=configuration_goodness_gradient(V1, H1)

vh0-vh1

}

rbm <- function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) {

# This trains a model that's defined by a single matrix of weights.

#is the number of hidden units

# cd1 is a function that takes parametersand and returns the gradient (or approximate gradient in the case of CD-1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided parameter.

# This uses mini-batches no weight decay and no early stopping.

# This returns the matrix of weights of the trained model.

n=dim(training_data)[2]

p=dim(training_data)[1]

if (n %% mini_batch_size != 0) {

stop("the number of test cases must be divisable by the mini_batch_size")

}

model = (matrix(runif(num_hidden*p),num_hidden,p) * 2 - 1) * 0.1

momentum_speed = matrix(0,num_hidden,p)

start_of_next_mini_batch = 1;

for (iteration_number in 1:n_iterations) {

if (!quiet) {cat("Iter",iteration_number,"\n")}

mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size - 1)]

start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n

gradient = cd1(model, mini_batch)

momentum_speed = momentum * momentum_speed + gradient

model = model + momentum_speed * learning_rate

}

return(model)

}

I loaded the hand written digit data that was given in the class. To train the RBM, use the syntax below.

weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000,

mini_batch_size=100, momentum=0.9)

After training the weights, I can visualize them. Below is a plot of strength of the weights going to each pixel. Each facet displays the weights going to/coming from one of the hidden units. I trained 30 units so that it would be easy to show them all on one plot. You can see that most of the hidden units correspond strongly to one digit or another. I think this means it is finding the joint structure of all of the pixels and that pixels representing those numbers are darkened together often.

library(ggplot2)

library(reshape2)

mw=melt(weights); mw$Var3=floor((mw$Var2-1)/16)+1; mw$Var2=(mw$Var2-1)%%16 + 1; mw$Var3=17-mw$Var3;

ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+

scale_fill_continuous(low='white',high='black')+coord_equal()+

labs(x=NULL,y=NULL,title="Visualization of Weights")+

theme(legend.position="none")

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistically Significant**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...