# Generate correlation matrices with complex survey data in R

**R on Jacob Long**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The `survey`

package is one of R’s best tools for those working in the

social sciences. For many, it saves you from needing to use commercial

software for research that uses survey data. However, it lacks one

function that many academic researchers often need to report in

publications: correlations. The `svycor`

function in `jtools`

(more info) helps to

fill that gap.

An initial note, however, is necessary. The basic method behind this

feature comes from a response to a

question

about calculating correlations with the `survey`

package written by

Thomas Lumley, the `survey`

package author—he has not seen (to my knowledge) or endorsed this function. All that is good about this

function should be attributed to Dr. Lumley; all that is wrong with it

should be attributed to me (Jacob).

With that said, let’s look at an example. First, we need to get a

`survey.design`

object. This one is built into the `survey`

package.

library(survey) data(api) dstrat <- svydesign(id = ~1,strata = ~stype, weights = ~pw, data = apistrat, fpc=~fpc)

## Basic use

The necessary arguments are no different than when using `svyvar`

.

Specify, using an equation, which variables (and from which design) to

include. It doesn’t matter which side of the equation the variables are

on.

svycor(~api00 + api99, design = dstrat) ## api00 api99 ## api00 1.00 0.98 ## api99 0.98 1.00

You can specify with the `digits =`

argument how many digits past the

decimal point should be printed.

svycor(~api00 + api99, design = dstrat, digits = 4) ## api00 api99 ## api00 1.0000 0.9759 ## api99 0.9759 1.0000

Any other arguments that you would normally pass to `svyvar`

will be

used as well, though in some cases it may not affect the output.

## Statistical significance tests

One thing that `survey`

won’t do for you is give you *p* values for the

null hypothesis that *r* = 0. While at first blush finding the *p* value

might seem like a simple procedure, complex surveys will almost

always violate the important distributional assumptions that go along with

simple hypothesis tests of the correlation coefficient. There is not a

clear consensus on the appropriate way to conduct hypothesis tests in

this context, due in part to the fact that most analyses of complex

surveys occurs in the context of multiple regression rather than simple bivariate cases.

If `sig.stats = TRUE`

, then `svycor`

will use the `wtd.cor`

function

from the `weights`

package to conduct hypothesis tests. The *p* values

are derived from a bootstrap procedure in which the weights define

sampling probability. The `bootn =`

argument is given to `wtd.cor`

to

define the number of simulations to run. This can significantly increase

the running time for large samples and/or large numbers of bootstraps.

The `mean1`

argument tells `wtd.cor`

whether it should treat your sample size

as the number of observations in the survey design (the number of rows

in the data frame) or the sum of the weights. Usually, the former is

desired, so the default value of `mean1`

is `TRUE`

.

svycor(~api00 + api99, design = dstrat, digits = 4, sig.stats = TRUE, bootn = 2000, mean1 = TRUE) ## api00 api99 ## api00 1 0.9759* ## api99 0.9759* 1

When using `sig.stats = TRUE`

, the correlation parameter estimates come

from the bootstrap procedure rather than the simpler method based

on the survey-weighted covariance matrix when `sig.stats = FALSE`

.

By saving the output of the function, you can extract non-rounded

coefficients, *p* values, and standard errors.

c <- svycor(~api00 + api99, design = dstrat, digits = 4, sig.stats = TRUE, bootn = 2000, mean1 = TRUE) c$cors ## api00 api99 ## api00 1.0000000 0.9759047 ## api99 0.9759047 1.0000000 c$p.values ## api00 api99 ## api00 0 0 ## api99 0 0 c$std.err ## api00 api99 ## api00 0.000000000 0.003467371 ## api99 0.003467371 0.000000000

## Technical details

The heavy lifting behind the scenes is done by `svyvar`

, which from its

output you may not realize also calculates covariance.

svyvar(~api00 + api99, design = dstrat) ## variance SE ## api00 15191 1255.7 ## api99 16518 1318.4

But if you save the `svyvar`

object, you can see that there’s more than

meets the eye.

var <- svyvar(~api00 + api99, design = dstrat) var <- as.matrix(var) var ## api00 api99 ## api00 15190.59 15458.83 ## api99 15458.83 16518.24 ## attr(,"var") ## api00 api00 api99 api99 ## api00 1576883 1580654 1580654 1561998 ## api00 1580654 1630856 1630856 1657352 ## api99 1580654 1630856 1630856 1657352 ## api99 1561998 1657352 1657352 1738266 ## attr(,"statistic") ## [1] "variance"

Once we know that, it’s just a matter of using R’s `cov2cor`

function

and cleaning up the output.

cor <- cov2cor(var) cor ## api00 api99 ## api00 1.0000000 0.9759047 ## api99 0.9759047 1.0000000 ## attr(,"var") ## api00 api00 api99 api99 ## api00 1576883 1580654 1580654 1561998 ## api00 1580654 1630856 1630856 1657352 ## api99 1580654 1630856 1630856 1657352 ## api99 1561998 1657352 1657352 1738266 ## attr(,"statistic") ## [1] "variance"

Now to get rid of that covariance matrix…

cor <- cor[1:nrow(cor), 1:nrow(cor)] cor ## api00 api99 ## api00 1.0000000 0.9759047 ## api99 0.9759047 1.0000000

`svycor`

has its own print method, so you won’t see so many digits past

the decimal point. You can extract the un-rounded matrix, however.

out <- svycor(~api99 + api00, design = dstrat) out$cors ## api99 api00 ## api99 1.0000000 0.9759047 ## api00 0.9759047 1.0000000

## Suggestions welcome

Whether you see a problem with the method, found a bug, or have a suggestion for an enhancement, I’d like to hear it. Bug reports on Github are probably the best way to do that, but I’ll keep an eye out for comments here as well.

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Jacob Long**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.