Copulas made easy

[This article was first published on dahtah » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everyday, a poor soul tries to understand copulas by reading the corresponding Wikipedia page, and gives up in despair. The incomprehensible mess that one finds there gives the impression that copulas are about as accessible as tensor theory, which is a shame, because they are actually a very nice tool. The only prerequisite is knowing the inverse cumulative function trick.

That trick runs as follows: suppose you want to generate samples from some distribution with probability density f(x). All you need is a source of *uniform* random variables, because you can transform these random variables to have the distribution that you want (which is why you can survive on a desert island with nothing but the rand function). Here’s how: if u is a random variable with uniform distribution over [0,1], and if F(x) is the cumulative distribution function corresponding to the density f(x), then:
has the right probability density. This is easy to prove using the classical transformation formula for random variables.

The trick also works in the other direction: if you take x and run it through F(x) than you get back u. So it’s also a way of making uniform random variables out of non-uniform ones.

Now let’s say that what you want to generate two correlated random variables x and y, representing for example “personal wealth” and “cigar consumption”. One obvious way to generate correlated random variables is to use a multivariate Gaussian, but here you can’t assume that your variables have marginal Gaussian distributions – wealth is notably non-Gaussian (I don’t know about cigar consumption). Let’s say that you want them to have marginal distributions f(x) and g(y), but you still want to preserve some king of positive correlation.
Here’s a possible recipe: generate a,b from a correlated Gaussian distribution. Then transform them using the cumulative Gaussian distribution into u = \Phi(a), v = \Phi(b). Now u and v have marginal *uniform* distributions, but are still positively correlated.
Finally transform again to x = F^-1(u), y = G^-1(v) – you still have positive correlation, but the marginals you want. You’ve just used a Gaussian copula. The technical definition of a copula you’ll find on Wikipedia corresponds to the joint probability distribution you have over (u,v), i.e. at the step where you have uniform marginals.

Here’s some R code that illustrates this:

S <- matrix(c(1,.8,.8,1),2,2) #Correlation matrix
AB <- rmvnorm(mean=c(0,0),sig=S,n=1000) #Our gaussian variables
U <- pnorm(AB) #Now U is uniform - check using hist(U[,1]) or hist(U[,2])
x <- qgamma(U[,1],2) #x is gamma distributed
y <- qbeta(U[,2],1,2) #y is beta distributed
plot(x,y) #They correlate!

That sort of stuff is tremendously useful when you want to have a statistical model for joint outcomes (for example when you want to describe how the dependency between wealth and cigar consumption changes depends on the country being US or Cuba).

Another interesting aspect of copulas, more theoretical, is that this also gives you a way of studying dependency independently of what the marginals look like…

To leave a comment for the author, please follow the link and comment on their blog: dahtah » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)