Using Optmatch and RItools for Observational Studies

[This article was first published on Mark M. Fredrickson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am a contributor to the optmatch and the RItools packages for R. These two packages are separate, but complimentary. Both packages provide tools for adjusting observational data to exhibit “balance” on observed covariates. In a randomized control trial, treatment and control groups should have identical distributions over all covariates, observed and unobserved. Matching provides a method to create smaller groups in an observational study that are similarly balanced. Balance can be quantified so that alternative matches can be compared. When an acceptable match has been found, analysis can then proceed as if nature provided a blocked, randomized study.


Both optmatch and RItools use a canonical dataset consisting of nuclear plants. From help(nuclearplants):

The data relate to the construction of 32 light water reactor (LWR) plants constructed in the U.S.A in the late 1960’s and early 1970’s. The data was collected with the aim of predicting the cost of construction of further LWR plants. 6 of the power plants had partial turnkey guarantees and it is possible that, for these plants, some manufacturers’ subsidies may be hidden in the quoted capital costs.

With these data, we may wish to know if certain variables lead to higher or lower construction costs. One particular variable is pr, an indicator if a previous lightwater reactor at the same location was present. Such an installation might significantly increase or decrease costs. The rest of this document uses matching and balance testing to provide an answer to just that question.

I start by loading the packages and the data:

> library(optmatch)
> library(RItools)
> data(nuclearplants)

Before getting into the matching process, lets take a quick look at the balance of the data on all variables, except cost and pr, the LWR indicator. xBalance, among other tests, provides an omnibus balance test across any number of variables. This test compares the null hypothesis of “the data are balanced” against the alternative hypothesis of a lack of balance, where balance is what we would expect in a randomized trial with the same sample size. The test follows a chi-squared distribution, which xBalance will happily report:

> xBalance(pr ~ . - (cost + pr),
data = nuclearplants,
report = c("chisquare.test"))

---Overall Test---
chisquare df p.value
unstrat 12.4 9 0.192
Signif. codes: 0 ‘***’ 0.001 ‘** ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1

With a reported p-value of 0.19, the balance of this sample is not terrible (by conventional levels of hypothesis testing), but we might prefer something closer to 1. While there is no a priori p-value we should prefer, experience indicates that p-values in the neighborhood of .5 are achievable and mimic true randomized designs (though optimal balance levels are a subject of ongoing research).


A full discussion of matching procedures is beyond the scope of this document (see Rosenbaum (2010) for a more comprehensive discussion). In brief, matching attempts to group units with similar covariates, as if they had been blocked in a randomized experiment. The optimal match would be two units identical on every variable, observed and unobserved. In most datasets, no two units will be identical on all observed covariates. Instead, we can use a measure that summarizes all covariates and match based on the summary. The propensity score, the probability of receiving treatment given the observed covariates, has been a popular summary measure (for more on the theory, see Rosenbaum and Rubin (1983)). I’ll use a logistic regression to estimate the propensity scores of my observations, using a subset of the available variables:

> model

With a propensity model, optmatch provides several functions for computing matched sets of observations. The fullmatch function takes a treatment by control matrix containing distances between observations and returns a factor indicating the set membership, if any, of all observations. Computing the distance matrix is simple using the mdist function. This function takes a linear model, a function, or a formula to produce distances based on propensity models, aribtrary user functions, or Mahalanobis distances between observations. We’ll use the propensity model. See the help page for mdist for the other alternatives.

m1 <- fullmatch(mdist(model))

We can compare the first match with a second, in which a caliper is placed on the date variable. This will constrain the matching algorithm, disallowing matches on observations with widely differing date values, even if the over all propensity scores are similar. Calipers can lead to poorer matches on observed variables but provide a method by which researchers can include subject matter information in the matching process. For example, if the cost of construction decreased over time due to increased efficiency in construction practices.

> m2

Balance Testing

With two possible matches, do either produce adequate balance? As noted previously, the RItools package provides a method of quantifying balance in a matched set. The method (discussed in detail in Hansen and Bowers (2008)) compares treatment and control units within each block on a difference of means for each variable. Combining the these differences follows a chi-squared distribution. We can compare all the matches at the same time, along with the raw data (see the strata argument).

> (allbalance

Both matches provide good balance. With a value of 0.821 we might be tempted to prefer the unconstrained match; however, with a p-value of 0.345, the match with a caliper also provides reasonable assurances of balance. As either provides plausible balance, researchers might choose to concentrate on substantively important covariates. When xBalance reports “std.diffs” (as above), we can plot the result to get a visual picture of balance on each covariate.

> plot(allbalance)

Covariate balance


Since we now have data that approximates a randomized experiment, we can use the same techniques to analyze this data as any blocked randomized experiment. For example, one-way ANOVA using pr as the treatment factor and m1 as the blocking factor.

> anova(lm(nuclearplants$cost ~ nuclearplants$pr + m1))
Analysis of Variance Table

Response: nuclearplants$cost
Df Sum Sq Mean Sq F value Pr(>F)
nuclearplants$pr 1 9037 9037 0.3394 0.5654
m1 5 222410 44482 1.6704 0.1785
Residuals 25 665726 26629

Under conventional levels, we do not observe either the treatment or the blocking factor reach statistical significance. So we can conclude that existing lightwater reactors do not have an effect on construction costs that we can differentiate from chance.


In the analysis, I chose one of two plausible matches. It so happened that I selected the match with the larger p-value. Does this indicate that we should select the match with the highest p-value, as it most closely approximates a randomly allocated treatment? I would caution against that conclusion. Within the set of matches that are plausibly balanced, it is difficult to argue that one match is truly better than another. While in expectation, randomized treatments are perfectly balanced, in pratice, small deviations should be expected (with fewer deviations in larger experimental populations).

In short, don’t sweat the small stuff. Find a reasonable match and go with it. In fact, you may find that matches with lower p-values provide interesting substantive results. Here is an analysis of the second match, which included a caliper on the date of construction:

> anova(lm(nuclearplants$cost ~ nuclearplants$pr + m2))
Analysis of Variance Table

Response: nuclearplants$cost
Df Sum Sq Mean Sq F value Pr(>F)
nuclearplants$pr 1 4185 4185 0.2035 0.65654
m2 8 396199 49525 2.4080 0.05098 .
Residuals 21 431905 20567
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This matching indicates a significant blocking effect, which suggests that limiting matches by date may have something to do with the resulting costs. If we had blindly pursued higher p-value matches, we might not have observed this interesting result.

To leave a comment for the author, please follow the link and comment on their blog: Mark M. Fredrickson. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)