# Analyzing rtweet data with kerasformula

**TensorFlow for R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is guest post contributed by Pete Mohanty, creator of the kerasformula package.

## Overview

The kerasformula package offers a high-level interface for the R interface to Keras. It’s main interface is the `kms`

function, a regression-style interface to `keras_model_sequential`

that uses formulas and sparse matrices.

The kerasformula package is available on CRAN, and can be installed with:

# install the kerasformula package install.packages("kerasformula") # or devtools::install_github("rdrr1990/kerasformula") library(kerasformula) # install the core keras library (if you haven't already done so) # see ?install_keras() for options e.g. install_keras(tensorflow = "gpu") install_keras()

## The kms() function

Many classic machine learning tutorials assume that data come in a relatively homogenous form (e.g., pixels for digit recognition or word counts or ranks) which can make coding somewhat cumbersome when data is contained in a heterogenous data frame. `kms()`

takes advantage of the flexibility of R formulas to smooth this process.

`kms`

builds dense neural nets and, after fitting them, returns a single object with predictions, measures of fit, and details about the function call. `kms`

accepts a number of parameters including the loss and activation functions found in `keras`

. `kms`

also accepts compiled `keras_model_sequential`

objects allowing for even further customization. This little demo shows how `kms`

can aid is model building and hyperparameter selection (e.g., batch size) starting with raw data gathered using `library(rtweet)`

.

Let’s look at #rstats tweets (excluding retweets) for a six-day period ending January 24, 2018 at 10:40. This happens to give us a nice reasonable number of observations to work with in terms of runtime (and the purpose of this document is to show syntax, not build particularly predictive models).

rstats <- search_tweets("#rstats", n = 10000, include_rts = FALSE) dim(rstats) [1] 2840 42

Suppose our goal is to predict how popular tweets will be based on how often the tweet was retweeted and favorited (which correlate strongly).

cor(rstats$favorite_count, rstats$retweet_count, method="spearman") [1] 0.7051952

Since few tweeets go viral, the data are quite skewed towards zero.

## Getting the most out of formulas

Let’s suppose we are interested in putting tweets into categories based on popularity but we’re not sure how finely-grained we want to make distinctions. Some of the data, like `rstats$mentions_screen_name`

comes in a list of varying lengths, so let’s write a helper function to count non-NA entries.

n <- function(x) { unlist(lapply(x, function(y){length(y) - is.na(y[1])})) }

Let’s start with a dense neural net, the default of `kms`

. We can use base R functions to help clean the data–in this case, `cut`

to discretize the outcome, `grepl`

to look for key words, and `weekdays`

and `format`

to capture different aspects of the time the tweet was posted.

breaks <- c(-1, 0, 1, 10, 100, 1000, 10000) popularity <- kms(cut(retweet_count + favorite_count, breaks) ~ screen_name + source + n(hashtags) + n(mentions_screen_name) + n(urls_url) + nchar(text) + grepl('photo', media_type) + weekdays(created_at) + format(created_at, '%H'), rstats) plot(popularity$history) + ggtitle(paste("#rstat popularity:", paste0(round(100*popularity$evaluations$acc, 1), "%"), "out-of-sample accuracy")) + theme_minimal() popularity$confusion

popularity$confusion (-1,0] (0,1] (1,10] (10,100] (100,1e+03] (1e+03,1e+04] (-1,0] 37 12 28 2 0 0 (0,1] 14 19 72 1 0 0 (1,10] 6 11 187 30 0 0 (10,100] 1 3 54 68 0 0 (100,1e+03] 0 0 4 10 0 0 (1e+03,1e+04] 0 0 0 1 0 0

The model only classifies about 55% of the out-of-sample data correctly and that predictive accuracy doesn’t improve after the first ten epochs. The confusion matrix suggests that model does best with tweets that are retweeted a handful of times but overpredicts the 1-10 level. The `history`

plot also suggests that out-of-sample accuracy is not very stable. We can easily change the breakpoints and number of epochs.

breaks <- c(-1, 0, 1, 25, 50, 75, 100, 500, 1000, 10000) popularity <- kms(cut(retweet_count + favorite_count, breaks) ~ n(hashtags) + n(mentions_screen_name) + n(urls_url) + nchar(text) + screen_name + source + grepl('photo', media_type) + weekdays(created_at) + format(created_at, '%H'), rstats, Nepochs = 10) plot(popularity$history) + ggtitle(paste("#rstat popularity (new breakpoints):", paste0(round(100*popularity$evaluations$acc, 1), "%"), "out-of-sample accuracy")) + theme_minimal()

That helped some (about 5% additional predictive accuracy). Suppose we want to add a little more data. Let’s first store the input formula.

pop_input <- "cut(retweet_count + favorite_count, breaks) ~ n(hashtags) + n(mentions_screen_name) + n(urls_url) + nchar(text) + screen_name + source + grepl('photo', media_type) + weekdays(created_at) + format(created_at, '%H')"

Here we use `paste0`

to add to the formula by looping over user IDs adding something like:

grepl("12233344455556", mentions_user_id) mentions <- unlist(rstats$mentions_user_id) mentions <- unique(mentions[which(table(mentions) > 5)]) # remove infrequent mentions mentions <- mentions[!is.na(mentions)] # drop NA for(i in mentions) pop_input <- paste0(pop_input, " + ", "grepl(", i, ", mentions_user_id)") popularity <- kms(pop_input, rstats)

That helped a touch but the predictive accuracy is still fairly unstable across epochs…

## Customizing layers with kms()

We could add more data, perhaps add individual words from the text or some other summary stat (`mean(text %in% LETTERS)`

to see if all caps explains popularity). But let’s alter the neural net.

The `input.formula`

is used to create a sparse model matrix. For example, `rstats$source`

(Twitter or Twitter-client application type) and `rstats$screen_name`

are character vectors that will be dummied out. How many columns does it have?

popularity$P [1] 1277

Say we wanted to reshape the layers to transition more gradually from the input shape to the output.

popularity <- kms(pop_input, rstats, layers = list(units = c(1024, 512, 256, 128, NA), activation = c("relu", "relu", "relu", "relu", "softmax"), dropout = c(0.5, 0.45, 0.4, 0.35, NA)))

`kms`

builds a `keras_sequential_model()`

, which is a stack of linear layers. The input shape is determined by the dimensionality of the model matrix (`popularity$P`

) but after that users are free to determine the number of layers and so on. The `kms`

argument `layers`

expects a list, the first entry of which is a vector `units`

with which to call `keras::layer_dense()`

. The first element the number of `units`

in the first layer, the second element for the second layer, and so on (`NA`

as the final element connotes to auto-detect the final number of units based on the observed number of outcomes). `activation`

is also passed to `layer_dense()`

and may take values such as `softmax`

, `relu`

, `elu`

, and `linear`

. (`kms`

also has a separate parameter to control the optimizer; by default `kms(... optimizer = 'rms_prop')`

.) The `dropout`

that follows each dense layer rate prevents overfitting (but of course isn’t applicable to the final layer).

## Choosing a Batch Size

By default, `kms`

uses batches of 32. Suppose we were happy with our model but didn’t have any particular intuition about what the size should be.

Nbatch <- c(16, 32, 64) Nruns <- 4 accuracy <- matrix(nrow = Nruns, ncol = length(Nbatch)) colnames(accuracy) <- paste0("Nbatch_", Nbatch) est <- list() for(i in 1:Nruns){ for(j in 1:length(Nbatch)){ est[[i]] <- kms(pop_input, rstats, Nepochs = 2, batch_size = Nbatch[j]) accuracy[i,j] <- est[[i]][["evaluations"]][["acc"]] } } colMeans(accuracy) Nbatch_16 Nbatch_32 Nbatch_64 0.5088407 0.3820850 0.5556952

For the sake of curtailing runtime, the number of epochs was set arbitrarily short but, from those results, 64 is the best batch size.

## Making predictions for new data

Thus far, we have been using the default settings for `kms`

which first splits data into 80% training and 20% testing. Of the 80% training, a certain portion is set aside for validation and that’s what produces the epoch-by-epoch graphs of loss and accuracy. The 20% is only used at the end to assess predictive accuracy. But suppose you wanted to make predictions on a new data set…

popularity <- kms(pop_input, rstats[1:1000,]) predictions <- predict(popularity, rstats[1001:2000,]) predictions$accuracy [1] 0.579 # predictions$confusion

Because the formula creates a dummy variable for each screen name and mention, any given set of tweets is all but guaranteed to have different columns. `predict.kms_fit`

is an `S3 method`

that takes the new data and constructs a (sparse) model matrix that preserves the original structure of the training matrix. `predict`

then returns the predictions along with a confusion matrix and accuracy score.

If your newdata has the same observed levels of y and columns of x_train (the model matrix), you can also use `keras::predict_classes`

on `object$model`

.

## Using a compiled Keras model

This section shows how to input a model compiled in the fashion typical to `library(keras)`

, which is useful for more advanced models. Here is an example for `lstm`

analogous to the imbd with Keras example.

k <- keras_model_sequential() k %>% layer_embedding(input_dim = popularity$P, output_dim = popularity$P) %>% layer_lstm(units = 512, dropout = 0.4, recurrent_dropout = 0.2) %>% layer_dense(units = 256, activation = "relu") %>% layer_dropout(0.3) %>% layer_dense(units = 8, # number of levels observed on y (outcome) activation = 'sigmoid') k %>% compile( loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics = c('accuracy') ) popularity_lstm <- kms(pop_input, rstats, k)

## Questions? Comments?

Drop me a line via the project’s Github repo. Special thanks to @dfalbel and @jjallaire for helpful suggestions!!

**leave a comment**for the author, please follow the link and comment on their blog:

**TensorFlow for R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.