by B. W. Lewis
This note warns about potentially misleading results when using the use=pairwise.complete.obs
and related options in R’s cor
and cov
functions. Pitfalls are illustrated using a very simple pathological example followed by a brief list of alternative ways to deal with missing data and some references about them.
Known unknowns
R includes excellent facilities for handling missing values for all native data types. Perhaps counterintuitively, marking a value as missing conveys the information that the value is not known. Donald Rumsfeld might call it a “known unknown^{1}.” Upon encountering a missing value, we can deal with it by simply omitting it, imputing it somehow, or through several other possible approaches. R does a good job of making our choice of missing value approach explicit.
The cov
and cor
functions in the R programming language include several options for dealing with missing data. The use="pairwise.complete.obs"
option is particularly confusing, and can easily lead to faulty comparisons. This note explains and warns against its use.
Consider the following tiny example:
(x = matrix(c(2,1,0,1,2,1.5,2,0,1,2,NA,NA,0,1,2),5))
## [,1] [,2] [,3]
## [1,] 2 1.5 NA
## [2,] 1 2.0 NA
## [3,] 0 0.0 0
## [4,] 1 1.0 1
## [5,] 2 2.0 2
The functions V=cov(x)
computes the symmetric covariance matrix V
with entries defined by the pairwise covariance of columns of x
,
where i=j=1,2,3
in this example. The function cor(x)
similarly computes the symmetric correlation matrix with entries defined by pairwise correlation of the columns of x
. For example:
cov(x)
## [,1] [,2] [,3]
## [1,] 2.5 0.0 NA
## [2,] 0.0 0.7 NA
## [3,] NA NA NA
cor(x)
## [,1] [,2] [,3]
## [1,] 1 0 NA
## [2,] 0 1 NA
## [3,] NA NA 1
Due to missing values in the third column of x
we know that we don’t know the covariance between x[,3]
and anything else. Thanks to an arguably questionable^{2} choice in R’s cov2cor
function, R reports that the correlation of x[,3]
with itself is one, but we don’t know the correlation between x[,3]
and the other columns.
The use="complete"
option is one way to deal with missing values. It simply removes rows of the matrix x
with missing observations. Since the columns of the third through fifth rows of our example matrix are all identical we expect perfect correlation across the board, and indeed:
cor(x, use="complete")
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 1 1 1
Reasonable people might question this approach. Deleting two observations has a huge effect on the correlation between x[,1]
and x[,2]
in this example mostly because of the large change in x[,1]
. The result really says that we should collect more observations!
Unknown knowns
The use="pairwise.complete.obs"
is an even less reasonable way to deal with missing values. When specified, R computes correlations for each pair of columns using vectors formed by omitting rows with missing values on a pairwise basis. Thus each column vector may vary depending on it’s pairing, resulting in correlation values that are not even comparable. Consider our simple example again:
cor(x, use="pairwise.complete.obs")
## [,1] [,2] [,3]
## [1,] 1 0 1
## [2,] 0 1 1
## [3,] 1 1 1
By this bizarre measurement, the correlation of x[,1]
and x[,2]
is zero (as we saw above in the first example), and yet cor
claims that x[,3]
is perfectly correlated with both x[,1]
and x[,2]
. In other words, the result is nonsense. As Rumsfeld might say, we’ve converted known unknowns into unknown knowns.
What’s going on here is that the reported correlations are not comparable because they are computed against different vectors: all of x[,1]
and x[,2]
are compared to each other, but only parts of x[,1]
and x[,2]
are compared to x[,3]
.
The bad result is obvious for our small example. But the danger here is in large matrices with lots of missing values, where it may be impossible to use the pairwise
option in a meaningful way.
Recommendations
If you want to run correlations on lots of vectors with missing values, consider simply using the R default of use="everything"
and propagating missing values into the correlation matrix. This makes it clear what you don’t know.
If you really don’t want to do that, consider imputing the missing values. The simplest method replaces missing values in each column with the mean of the nonmissing values in the respective column:
m = mean(na.omit(x[,3]))
xi = x
xi[is.na(x)] = m
cor(xi)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0.4472136
## [2,] 0.0000000 1.0000000 0.8451543
## [3,] 0.4472136 0.8451543 1.0000000
This can be done really efficiently when “centering” a matrix by simply replacing missing values of the centered matrix with zero.
Sometimes it might make more sense to use a piecewise constant interpolant, referred to as “last observation carry forward” especially when dealing with time series and ordered data. In yet other cases a known default value (perhaps from a much larger population than the one under study) might be more appropriate.
Another basic approach bootstraps the missing values from the nonmissing ones:
i = is.na(x[,3])
N = sum(i)
b = replicate(500, {x[i,3] = sample(x[!i,3], size=N, replace=TRUE);cor(x[,1:2],x[,3])})
# Average imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,mean)
## [1] 0.3722048 0.6845172
# Standard deviation of imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,sd)
## [1] 0.3361684 0.2156523
If you have lots of observations, consider partitioning them with a basic clustering algorithm first and then imputing the missing values from their respective cluster cohorts. Or consider a matching method or a regressionbased imputation method. See the references below for many more details.
References and packages
There are of course many excellent R packages and references on missing data. I recommend consulting the following packages and references:
 http://www.stat.columbia.edu/~gelman/arm/missing.pdf
 http://cran.rproject.org/web/packages/mi/
 http://gking.harvard.edu/amelia
 http://cran.rproject.org/web/views/OfficialStatistics.html (Look for the “Imputation” section in this task view for a list of R packages.)

The
cov2cor
function used bycor
always puts ones along the diagonal of a correlation matrix; that choice is valid only if all unknowns may assume bounded and valid numeric values which is actually pretty reasonable. But in a rare example of inconsistency in Rcor(x[,3],x[,3])
returnsNA
. Yikes!)↩
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...