The most wonderful and most frustrating characteristic of the Internet is its excessive supply of content. As a result, many of todayâ€s commercial giants are not content providers, but content distributors. The success of companies such as Amazon, Netflix, YouTube and Spotify relies on their ability to effectively deliver relevant and novel content to users. However, with such a vast array of content at their fingertips, the search space becomes near impossible to navigate with traditional search methods. It is therefore essential for businesses to exploit the data at their disposal to find similarities between products and user behaviours, in order to make relevant recommendations to users.
The importance of this is further emphasised by phenomena such as the Long Tail, a term popularised byÂ Chris Andersonâ€s iconic 2004 blog post. This refers to the fact that a large percentage of online distributorsâ€ revenue comes from the sale of less popular items, for which they are able to find a market thanks to their recommendation engines. â€œIf the Amazon statistics are any guide, the market for books that are not even sold in the average bookstore is larger than the market for those that areâ€�.
Another interesting example is Spotify, a company which invests heavily in recommendation, since one of their selling points is their ability to build perfectly curated playlists for individual users. A lesser-known ulterior motive of Spotifyâ€s recommendations is their need to reduce their licensing costs, which are currentlyÂ growing at a faster rate than their revenue. By recommending relevant songs by emerging artists, towards which Spotify can pay lower licensing fees, the company can reduce their average cost per listen. Similarly, any business with a large product range might find a recommendation engine useful for identifying which products to push to certain customers.
Figure 1 – Anatomy of the Long Tail (https://wired.com/2004/10/tail/)
Now, I hear you cry, â€œwhat considerations does one have to make when building a recommender system?â€� Well, Iâ€m glad you asked!
First and foremost, we are trying to solve a business need. Just because an algorithm perfectly predicts a userâ€s movie rating, this might not necessarily translate to a higher level business metric. What are we optimising for? User retention? An increase in sales? How many items have the average user bookmarked or purchased? How many recommended items users have clicked on? These goals will vary for different business contexts. Even with a well-defined business objective, these are only observations we can make after a model has been trained and deployed, and many successive iterations of A/B testing will be required to establish the usefulness of a model.
To add further complexity, it is possible that some users respond better to one type of model over another. Then the question arises: do we use varying algorithms for different user profiles, and how do we identify those profiles? This is where a weighted hybrid recommender system might come in. A more adventurous user might prefer more exploratory recommendations, whereas a conservative user may only respond to recommendations which closely relate to their browsing history. How do we balance customer satisfaction with the need to push new content on them? A content distributor may be satisfied with suiting the needs of their customer, whereas a content provider may also want to increase the sale of their less popular items, alongside increasing their customer retention.
Furthermore, we need to determine whether the operational costs of developing and maintaining an advanced recommender system are worth the potentially marginal improvements in content suggestions. Aside from the cost of hiring researchers and engineers, there can also be large costs associated with training an advanced recommendation engine on the cloud, such as that of Amazon or Spotify. As the size of the user base and item database increase, so will the operational costs. An algorithm which has to compare an item to the whole user database to perform a recommendation (such as memory-based collaborative filtering) is not as scalable as one which uses item properties and metadata to identify similar items (such as content-based recommendations). However, it might also be that a more complex algorithm (such as matrix factorization) – popularised by theÂ Netflix PrizeÂ would be able to extract better features from the data, with the caveat of requiring much more time and compute power to train.
This highlights the importance of clearly defining business goals and evaluation metrics before launching into a venture such as this, since A/B testing might reveal that an expensive recommender system which offers marginally better recommendations might not have any better impact on the bottom line than a simple one.
Figure 2 – Data Sparsity (https://ebaytech.berlin/deep-learning-for-recommender-systems-48c786a20e1a)
The second major challenge we face is the sparsity of our dataset. The average userâ€s activity only provides a limited amount of data relating to their likes and dislikes. The biggest mistake we can make is to assume that a user who has not clicked or rated an item necessarily dislikes that item. The more likely explanation is that the user has not yet discovered it. As a result, missing values need to be ignored, rather than included as dislikes or 0 ratings. However, this results in a very sparse dataset in which users have only interacted with a fraction of the available items. This leads to a few issues – can we guarantee that we have a full picture of this user? How do we make predictions for a new user, for whom we have no data available? This is also known as the â€œcold startâ€� problem. Potential solutions include recommending the most popular items (YouTube and Amazon home page), user-inductions which request information from a new user (such as Reddit or Quora) or extracting metadata from items to compare them,Â such as Spotify.
For this reason, implicit feedback is oftentimes preferred. This refers to the use of data such as number of clicks, shares and streaming time. The advantage of this over explicit feedback is that it allows businesses to collect more data on their users, who may otherwise be unwilling to give explicit ratings. It also removes any potential bias towards users who may be particularly expressive of their opinions but do not represent the majority.
However, implicit feedback brings its own set of problems. Whereas a 5-star rating has a predetermined scale, which allows us to adjust any bias towards users who are more critical or complementary than average, implicit feedback is more difficult to deal with. How do we determine the relative value between a click, a like or a sale? In addition, how do we deal with data from a user who may have listened to their favourite song 99 times but also has a special place in their heart for that song they only listen to once a month?
Some algorithms simply ignore values such as play count and transform them into binary 1s and 0s, whereas others use them as a confidence metric for how much a user likes an item. This may be part of the reason why YouTube and Netflix have switched to a like/dislike based system rather than 5-star ratings. Likes might have a better usage than 5-star ratings, and oftentimes confer the same amount of information to a recommender system as a 5-star rating.
In follow up posts, I will explore the different types of recommender systems, followed by an implementation of these using recent technologies such as PyTorch.
Today is the first day of the new academic year at the University of Utah. This semester I am teaching MATH 3070: Applied Statistics I, the fourth time I’ve taught this course.
This is the first semester where I feel like I actually am fully, 100% prepared to teach this class. I’ve taught MATH 1070: Introduction to Statistical Inference many times and got comfortable with teaching what I call “Statistics If You Don’t Like Math”, which is a terminal math course. MATH 3070 is “Statistics If You Do Like Math” and covers way more material. I struggled with the pacing the first two times I taught the course, so I’m glad I think I finally have that pacing down.
I just finished the public web page for the class that includes all the material (aside from stuff students have to buy, like the textbooks) I will be using for the class. There are three parts of this page that I’m excited to share.
First, there’s the lecture notes. I wrote the bulk of these notes in the spring semester, using R Markdown and the tufte package for Tufte-style handouts. These notes are half-filled notes meant to accompany my lectures. In response to feedback, I no longer use chalk but give these handouts to students and fill them out on my laptop (which has a touch screen; I use a compatible pen) which has its desktop projected behind me so the students can follow along. This greatly improves the flow of the class; no stopping to write long definitions!
The notes are meant to accompany the textbook, Jay Devore’s Probability and Statistics for Engineering and the Sciences, but with my own thoughts and examples, along with accompanying R code. Students can not only see the mathematics but also how these procedures can be done in R on a computer. Since R programming is an important skill the students will need to develop in the class, this addition should improve the course overall.
The chapter notes are available in parts, but I recently used bookdown to combine all the notes into one omnibus document, available here.
As these notes were written to accompany a textbook they are not meant to stand alone, though the enterprising instructor could possibly use my notes (without using Devore’s good book) and fill them in for their class, treating them as a major part of class materials.
Next, there’s the lab lecture notes. As mentioned above, R programming is an important skill I hope to develop in my students, so the class comes with an R programming lab (not taught by me, though I have taught it before) that teaches students (presumed to be programming novices) about R and programming. I wrote lecture notes to accompany the R lab textbook, John Verzani’s Using R for Introductory Statistics. This was just in R Markdown before, but I now have a bookdown version that is publicly available and more easily used than the collection of HTML documents I had before.
These notes come in two versions. There’s the summer semester version, which is the original version. These notes were written for an eight-week intensive schedule, and thus are divided into eight lectures. These notes were also written when I both taught the lecture and the lab at the same time, thus giving me perfect coordination between the two sections. Then there’s the regular semester version. This version was written for a 14-week course. I divided the summer schedule lectures and also added new lectures not present before (on tidyverse packages and Bayesian statistics) to slow down the lab to keep pace with the lecture course (not taught by me at the time); thus, these lectures include strictly more material. Unlike the earlier lecture notes (which must be in PDF format since white space is a crucial part of the notes), these notes come in both online and PDF versions, for both good online access and to have something printable.
All the source materials for these notes are publicly available too in this archive, should you desire to modify them or at least see how they were made (but if you do modify them, please be sure to cite me).
I submitted the summer lab lecture notes to the bookdown contest. While the book may not seem innovative to those who are familiar with bookdown, I feel like its existence is a major innovation, and I’m proud of it.
Finally, there’s StatTrainer. This is a Shiny app that I originally wrote for MATH 1070, but I think is still useful for MATH 3070. It’s an app that generates random statistics problems for students covering confidence intervals and hypothesis testing. This is to help aid study, giving students infinite practice problems. This app can be started from the command line on *NIX systems (see the webpage for instructions). I’m proud of this app and have mused on making a package based around it, not only implementing that specific app but also providing a framework for modifying it.
Hopefully someone out there, from a student or autodidact to an instructor or package author, finds this material useful. I worked hard on it (I’m shocked at how many pages I’ve apparently written in notes), and I can’t wait to see how the semester plays out with it.
I have created a video course published by Packt Publishing entitled Applications of Statistical Learning with Python, the fourth volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for data science, emphasizing application. It starts with introducing natural language processing and computer vision. It ends with two case studies; in one, I train a classifier to detect spam e-mail, while in the other, I train a computer vision system to detect emotions in faces. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
This is the first of a series of post on the BooST (Boosting Smooth Trees). If you missed the first post introducing the model click here and if you want to see the full article click here. The BooST is a model that uses Smooth Trees as base learners, which makes it possible to approximate the derivative of the underlying model. In this post, we will show some examples on generated data of how the BooST approximates the derivatives and we also will discuss how the BooST may be a good choice when dealing with smooth functions if compared to the usual discrete Regression Trees.
To install the BooST package in R you should run the following code:
library(devtools) install_github("gabrielrvsc/BooST")
Note that this R implementation is suited just for small problems. If you want to use the BooST on larges instances with more speed we recommend the Julia implementation until the C++ is ready. The Julia package can be installed with:
Pkg.clone("https://github.com/gabrielrvsc/BooSTjl.jl")
and then using BooSTjl
in your Julia terminal. Both packages have documentation for all exported functions.
The first example is the one we briefly discussed in the previous post. We are going to generate data from:
where , is a Bernoulli with and . The function used to generate the data is:
dgp = function(N,r2){ X = matrix(rnorm(N*2,0,1),N,2) X[,ncol(X)] = base::sample(c(0,1),N,replace=TRUE) aux = X yaux = cos(pi*(rowSums(X))) vyaux = var(yaux) ve = vyaux*(1-r2)/r2 e = rnorm(N,0,sqrt(ve)) y = yaux+e var(yaux)/var(y) return(list(y = y, X = X)) }
We are going to generate 1000 observations with an R2 of 0.3 (a lot of noise). The code below generates the data and runs the BooST and the Boosting using the xgboost package. We estimated 300 trees in each model with a step of 0.2. The last lines in the code just organize the results in a data.frame and generate values for the real function we want to recover.
library(BooST) library(tidyverse) library(xgboost) library(reshape2) set.seed(1) data = dgp(N = 1000, r2 = 0.3) y = data$y x = data$X set.seed(1) BooST_Model = BooST(x,y, v = 0.2, M = 300 ,display = TRUE) xgboost_Model = xgboost(x,label = y, nrounds = 300, params = list(eta = 0.2, max_depth = 3)) x1r = rep(seq(-4,4,length.out = 1000), 2) x2r = c(rep(0,1000), rep(1,1000)) yr = cos(pi*(x1r+x2r)) real_function = data.frame(x1 = x1r, x2 = as.factor(x2r), y = yr) fitted = data.frame(x1 = x[,1],x2 = as.factor(x[,2]), BooST = fitted(BooST_Model), xgboost = predict(xgboost_Model,x), y = y)
Before going into the results let’s have a look at the data in the figure below. The two black lines are the real cosine function we used and the dots are the data we generated. In this first look it seems hard to recover the real function from this data.
ggplot() + geom_point(data = fitted, aes(x = x1, y = y, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
The next figure shows the result with regular Boosting using the xgboost package. The points in blue and red are what we fitted and the points in gray are the data from the previous plot. All the structure in the model comes from the cosine function represented by the two black lines. The main conclusion here is that we over-fitted the data.
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = xgboost, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
The next plot shows what we obtained with the BooST. Again, red and blue points are fitted values and the data is in gray. The model fits the function very well with a few exceptions on extreme points where we have much less data.
ggplot() + geom_point(data = fitted, aes(x = x1, y = y), color = "gray") + geom_point(data = fitted, aes(x = x1, y = BooST, color = x2)) + geom_line(data = real_function, aes(x = x1, y = y, linetype = x2))
Next, let’s have a look at the derivatives. The code below estimates them and organizes the results for the plot.
BooST_derivative = estimate_derivatives(BooST_Model, x, 1) derivative = data.frame(x1 = x[,1],x2 = as.factor(x[,2]), derivative = BooST_derivative) dr = -1*sin(pi*(x1r+x2r))*pi real_function$derivative = dr
The results are in the next plot. As we can see, the model also estimates the derivatives very well. However, the performance deteriorates as we go to the boarders where we have less data. This is a natural feature of many nonparametric models, which are more precise where the data are more dense.
ggplot() + geom_point(data = fitted, aes(x = x1, y = BooST_derivative, color = x2)) + geom_line(data = real_function, aes(x = x1, y = derivative, linetype = x2))
Finally, let’s generate new data from the same dgp and see how the BooST and the Boosting perform. The output of the following code is the BooST RMSE divided by the Boosting RMSE.
set.seed(2) dataout=dgp(N=1000,r2=0.3) yout = dataout$y xout = dataout$X p_BooST = predict(BooST_Model, xout) p_xgboost = predict(xgboost_Model, xout) sqrt(mean((p_BooST - yout)^2))/sqrt(mean((p_xgboost - yout)^2))
## [1] 0.9165293
In the previous example we only had two variables. What happens if we add more? In the next example we are going to generate data from a dgp with 10 variables, where the function we are interested in comes from all second order interactions with all variables:
where,
with all variables and generated from a standard Normal distribution. The difficulty comes from the fact that we have to many interactions. Nevertheless, let’s see how the BooST recovers derivatives in this setup. Since we now have 10 variables interacting a plot like the one we did with the cosine is no longer possible. A good way to make this example more visual is to keep all variables fixed except one and see the derivative as we move on this one variable. However, this strategy needs a lot of data to work because we may be calculating derivatives in parts of the space that were poorly mapped. First, the dgp function:
dgp2 = function(N,k,r2){ yaux = rep(0,N) x = matrix(rnorm(N*k),N,k) for(i in 1:k){ for(j in i:k) yaux = yaux + x[,i]*x[,j] } vyaux=var(yaux) ve=vyaux*(1-r2)/r2 e=rnorm(N,0,sqrt(ve)) y=yaux+e var(yaux)/var(y) return(list(y=y,x=x)) }
Now let’s estimate the model. Given the complexity of this model, we adopted a more conservative strategy with a step of 0.1 and smaller values for gamma (controls the transition on the logistic function). These adjustment may require more trees to converge. Therefore, we used M=1000 trees.
set.seed(1) data = dgp2(1000,10,0.7) x = data$x y = data$y set.seed(1) BooSTmv = BooST(x,y, v = 0.1, M = 1000, display = TRUE, gamma = seq(0.5,1.5,0.01))
The next step is to put the data in the way we need to calculate the derivative. We are going to look at the derivative of with respect to the first variable keeping all other variables in the mean, which is 0. The solution in this case will be for the derivative.
xrep = x xrep[,2:ncol(x)] = 0 derivative_mv = estimate_derivatives(BooSTmv,xrep,1) df = data.frame(x1 = x[,1], derivative = derivative_mv) xr1 = seq(-4,4,0.01) dfr = data.frame(x1 = xr1, derivative = 2*xr1)
Finally, the results. The black line is the real derivative and the blue dots are what we estimated. Given the complexity of the problem the results are very good. The blue dots are always close to the black line except by some extreme values where we have less data.
ggplot() + geom_point(data = df, aes(x = x1, y = derivative),color = "blue") + geom_line(data = dfr, aes(x = x1, y = derivative)) + xlim(-4,4)
FOAAS upstream recently went to release 2.0.0, so here we are catching up bringing you all the new accessors from FOAAS 2.0.0: bag()
, equity()
, fts()
, ing()
, particular()
, ridiculous()
, and shit()
. We also added off_with()
which was missing previously. Documentation and tests were updated. The screenshot shows an example of the new functions.
As usual, CRANberries provides a diff to the previous CRAN release. Questions, comments etc should go to the GitHub issue tracker. More background information is on the project page as well as on the github repo
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
The rvest package allows for simple and convenient extraction of data from the web into R, which is often called “web scraping.” Web scraping is a basic and important skill that every data analyst should master. You’ll often see it as a job requirement.
In the following exercises, you will practice your scraping skills on the “Money” section of the CNN website. All of the main functions of the rvest package will be used. Answers to these exercises are available here.
Since websites are constantly changing, some of the solutions might grow to be outdated with time. If this is the case, you are welcome to inform the author and the relevant sections will be updated.
Exercise 1
Read the HTML content of the following URL with a variable called webpage
:
https://money.cnn.com/data/us_markets/
At this point, it will also be useful to open this web page in your browser.
Exercise 2
Get the session details (status, type, size) of the above mentioned URL.
Exercise 3
Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.)
Exercise 4
Extract all of the “3 Month % Change” values from the “Stock Sectors” table.
Exercise 5
Extract the table “What’s Moving” (top middle of the web page) into a data-frame.
Exercise 6
Re-construct all of the links from the first column of the “What’s Moving” table.
Hint: the base URL is “https://money.cnn.com”
Exercise 7
Extract the titles under the “Latest News” section (bottom middle of the web page.)
Exercise 8
To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see.
Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table.
Exercise 9
Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.)
Hint: in this case, the values are stored under the “class” attribute.
Exercise 10
Get the links of all of the “svg” images on the web page.
ThisÂ work is still in progres. I think, however, it can already resonate with some people in the community. The communication I am hopeful for should lead to a better design and maybe getting valuable tools faster.
The main goal is to extend the base R’s history mechanism (see ?history
) which currently gives access to past commands run in R. What if, however, we could browse not only the commands but also the objects (artifacts)? Hence, theÂ repository of artifacts.
It is implemented by a number of packages. The two most important are: the repositoryÂ which provides the basic logic of storing, processing and retrieving artifacts; and the uiÂ which implements a basic, text-only user interface and hooks callbacks into R. The other packages are: storage, defer and utilities.
Here are the basic rules of how repository of artifacts works: the state of R session after each command is examined and all R objects and plots are recorded, together with the information about their origin (parent objects). Thus, the complete graph of origin of each artifact can be retrieved from the repository: the complete sequence of R commands and their byproduct artifacts. Further explanation can be found in theÂ current motivation and plan for future workÂ and examples of working with the repository are presented in thisÂ tutorial.
Questions I hope to explore with those interested are:
I’m sure we all have our own words we use way too often.
Text analysis can also be used to discover patterns in writing, and for a writer, may be helpful in discovering when we depend too much on certain words and phrases. For today’s demonstration, I read in my (still in-progress) novel – a murder mystery called Killing Mr. Johnson – and did the same type of text analysis I’ve been demonstrating in recent posts.
To make things easier, I copied the document into a text file, and used the read_lines and tibble functions to prepare data for my analysis.
setwd("~/Dropbox/Writing/Killing Mr. Johnson")
library(tidyverse)
KMJ_text <- read_lines('KMJ_full.txt')
KMJ <- tibble(KMJ_text) %>%
mutate(linenumber = row_number())
I kept my line numbers, which I could use in some future analysis. For now, I’m going to tokenize my data, drop stop words, and examine my most frequently used words.
library(tidytext)
KMJ_words <- KMJ %>%
unnest_tokens(word, KMJ_text) %>%
anti_join(stop_words)
KMJ_words %>%
count(word, sort = TRUE) %>%
filter(n > 75) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() + xlab(NULL) + coord_flip()
Fortunately, my top 5 words are the names of the 5 main characters, with the star character at number 1: Emily is named almost 600 times in the book. It’s a murder mystery, so I’m not too surprised that words like “body” and “death” are also common. But I know that, in my fiction writing, I often depend on a word type that draws a lot of disdain from authors I admire: adverbs. Not all adverbs, mind you, but specifically (pun intended) the “-ly adverbs.”
ly_words <- KMJ_words %>%
filter(str_detect(word, ".ly")) %>%
count(word, sort = TRUE)
head(ly_words)
## # A tibble: 6 x 2
## word n
##
## 1 emily 599
## 2 finally 80
## 3 quickly 60
## 4 emilyâ€s 53
## 5 suddenly 39
## 6 quietly 38
Since my main character is named Emily, she was accidentally picked up by my string detect function. A few other top words also pop up in the list that aren’t actually -ly adverbs. I’ll filter those out then take a look at what I have left.
filter_out <- c("emily", "emily's", "emilyâ€s","family", "reply", "holy")
ly_words <- ly_words %>%
filter(!word %in% filter_out)
ly_words %>%
filter(n > 10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() + xlab(NULL) + coord_flip()
I use “finally”, “quickly”, and “suddenly” far too often. “Quietly” is also up there. I think the reason so many writers hate on adverbs is because it can encourage lazy writing. You might write that someone said something quietly or softly, but is there a better word? Did they whisper? Mutter? Murmur? Hiss? Did someone “move quickly” or did they do something else – run, sprint, dash?
At the same time, sometimes adverbs are necessary. I mean, can I think of a complete sentence that only includes an adverb? Definitely. Still, it might become tedious if I keep depending on the same words multiple times, and when a fiction book (or really any kind of writing) is tedious, we often give up. These results give me some things to think about as I edit.
Still have some big plans on the horizon, including some new statistics videos, a redesigned blog, and more surprises later! Thanks for reading!
Version 2.5-0 of the R package ‘sandwich’ is available from CRAN now with enhanced object-oriented clustered covariances (for lm, glm, survreg, polr, hurdle, zeroinfl, betareg, …). The software and corresponding vignette have been improved considerably based on helpful and constructive reviewer feedback as well as various bug reports.
Most of the improvements and new features pertain to clustered covariances which had been introduced to the sandwich package last year in version 2.4-0. For this my PhD student Susanne Berger and myself (= Achim Zeileis) teamed up with Nathaniel Graham, the maintainer of the multiwayvcov package. With the new version 2.5-0 almost all features from multiwayvcov have been ported to sandwich, mostly implemented from scratch along with generalizations, extensions, speed-ups, etc.
The full list of changes can be seen in the NEWS file. The most important changes are:
The manuscript vignette("sandwich-CL", package = "sandwich")
has been significantly improved based on very helpful and constructive reviewer feedback. See also below.
The cluster
argument for the vcov*()
functions can now be a formula, simplifying its usage (see below). NA
handling has been added as well.
Clustered bootstrap covariances have been reimplemented and extended in vcovBS()
. A dedicated method for lm
objects is considerably faster now and also includes various wild bootstraps.
Convenient parallelization for bootstrap covariances is now available.
Bugs reported by James Pustejovsky and Brian Tsay, respectively, have been fixed.
Susanne Berger, Nathaniel Graham, Achim Zeileis: Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R
Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, or other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to â€œtheâ€� clustered standard errors, there is a surprisingly wide variety of clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g. for zero-inflated, censored, or limited responses).
In R, functions for covariances in clustered or panel models have been somewhat scattered or available only for certain modeling functions, notably the (generalized) linear regression model. In contrast, an object-oriented approach to â€œrobustâ€� covariance matrix estimation – applicable beyond lm()
and glm()
– is available in the sandwich package but has been limited to the case of cross-section or time series data. Now, this shortcoming has been corrected in sandwich (starting from version 2.4.0): Based on methods for two generic functions (estfun()
and bread()
), clustered and panel covariances are now provided in vcovCL()
, vcovPL()
, and vcovPC()
. Moreover, clustered bootstrap covariances, based on update()
for models on bootstrap samples of the data, are provided in vcovBS()
. These are directly applicable to models from many packages, e.g., including MASS, pscl, countreg, betareg, among others. Some empirical illustrations are provided as well as an assessment of the methodsâ€ performance in a simulation study.
To show how easily the clustered covariances from sandwich
can be applied in practice, two short illustrations from the manuscript/vignette are used. In addition to the sandwich
package the lmtest
package is employed to easily obtain Wald tests of all coefficients:
library("sandwich")
library("lmtest")
options(digits = 4)
First, a Poisson model with clustered standard errors from Aghion et al. (2013, American Economic Review) is replicated. To investigate the effect of institutional ownership on innovation (as captured by citation-weighted patent counts) they employ a (pseudo-)Poisson model with industry/year fixed effects and standard errors clustered by company, see their Table I(3):
data("InstInnovation", package = "sandwich")
ii <- glm(cites ~ institutions + log(capital/employment) + log(sales) + industry + year,
data = InstInnovation, family = poisson)
coeftest(ii, vcov = vcovCL, cluster = ~ company)[2:4, ]
## Estimate Std. Error z value Pr(>|z|)
## institutions 0.009687 0.002406 4.026 5.682e-05
## log(capital/employment) 0.482884 0.135953 3.552 3.826e-04
## log(sales) 0.820318 0.041523 19.756 7.187e-87
Second, a simple linear regression model with double-clustered standard errors is replicated using the well-known Petersen data from Petersen (2009, Review of Financial Studies):
data("PetersenCL", package = "sandwich")
p <- lm(y ~ x, data = PetersenCL)
coeftest(p, vcov = vcovCL, cluster = ~ firm + year)
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0297 0.0651 0.46 0.65
## x 1.0348 0.0536 19.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In addition to the description of the methods and the software, the manuscript/vignette also contains a simulation study that investigates the properties of clustered covariances. In particular, this assesses how well the methods perfom in models beyond linear regression but also compares different types of bias adjustments (HC0-HC3) and alternative estimation techniques (generalized estimating equations, mixed effects).
The detailed results are presented in the manuscript – here we just show the results from one of the simulation experiments: The empirical coverage of 95% Wald confidence intervals is depicted for a beta regression, zero-inflated Poisson, and zero-truncated Poisson model. With increasing correlation within the clusters the conventional â€œstandardâ€� errors and â€œbasicâ€� robust sandwich standard errors become too small thus leading to a drop in empirical coverage. However, both clustered HC0 standard errors (CL-0) and clustered bootstrap standard errors (BS) perform reasonably well, leading to empirical coverages close to the nominal 0.95.
Details: Data sets were simulated with 100 clusters of 5 observations each. The cluster correlation (on the x-axis) was generated with a Gaussian copula. The only regressor had a correlation of 0.25 with the clustering variable. Empirical coverages were computed from 10,000 replications.