Calculating And Visualising Correlation Coefficients With Inspectdf

[This article was first published on Alastair Rushworth, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Calculating and visualising correlation coefficients with inspectdf (and why correlations matrices make life hard)

In a previous post, we explored categorical data using the inspectdf
In this post, we tackle a different exploratory problem of calculating
and visualising correlation coefficients. To install inspectdf from
CRAN, you’ll first need to run:

<span class="n">installed.packages</span><span class="p">(</span><span class="s2">"inspectdf"</span><span class="p">)</span><span class="w">

We’ll begin the tutorial by loading the inspectdf and dplyr
packages, the latter we’ll need for some dataframe manipulation.

<span class="n">library</span><span class="p">(</span><span class="n">inspectdf</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">

For this walk-through, we’ll explore the storms dataset which comes
from the dplyr package and has many numeric columns. The data includes
the positions and attributes of 198 tropical storms, measured every six
hours during the lifetime of a storm.

<span class="c1"># check out the storms dataset</span><span class="w">
</span><span class="o">?</span><span class="n">storms</span><span class="w">

What’s wrong with cor()?

Most R users will be familiar with the built-in stats function,
cor() which can be used to produce a matrix of correlation
coefficients of pairs of numeric variables. So why not just use this?
Here’s a short list of pain points that occur when using this function:

1. cor() requires numeric inputs only

Correlations are only defined for numeric pairs of variables, so perhaps
this shouldn’t be a surprise. But it means we can’t simply pass a
dataframe with mixed types to cor() and expect that it will be smart
enough to return correlations for just the numeric columns. Consequently
this fails:

<span class="n">cor</span><span class="p">(</span><span class="n">storms</span><span class="p">)</span><span class="w">
## Error in cor(storms): 'x' must be numeric
2. Correlation matrices are hard to read

It isn’t hard get what we want from cor() by first selecting the
numeric columns using a bit of dplyr:

<span class="n">cor</span><span class="p">(</span><span class="n">storms</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select_if</span><span class="p">(</span><span class="n">is.numeric</span><span class="p">))</span><span class="w">
##                     year        month           day          hour
## year         1.000000000 -0.011488006  0.0183703369  0.0015741629
## month       -0.011488006  1.000000000 -0.1830702018 -0.0051201358
## day          0.018370337 -0.183070202  1.0000000000  0.0007164624
## hour         0.001574163 -0.005120136  0.0007164624  1.0000000000
## lat         -0.121252667 -0.065922836 -0.0508598742  0.0026823666
## long         0.060387523  0.048382680  0.0406477301 -0.0091876627
## wind         0.048966015  0.126682358 -0.0064971154  0.0018333102
## pressure    -0.072615741 -0.134238300 -0.0010113895  0.0016030589
## ts_diameter           NA           NA            NA            NA
## hu_diameter           NA           NA            NA            NA
##                      lat         long         wind     pressure
## year        -0.121252667  0.060387523  0.048966015 -0.072615741
## month       -0.065922836  0.048382680  0.126682358 -0.134238300
## day         -0.050859874  0.040647730 -0.006497115 -0.001011389
## hour         0.002682367 -0.009187663  0.001833310  0.001603059
## lat          1.000000000 -0.104014683  0.076141764 -0.103772744
## long        -0.104014683  1.000000000  0.004737422  0.058467333
## wind         0.076141764  0.004737422  1.000000000 -0.942249266
## pressure    -0.103772744  0.058467333 -0.942249266  1.000000000
## ts_diameter           NA           NA           NA           NA
## hu_diameter           NA           NA           NA           NA
##             ts_diameter hu_diameter
## year                 NA          NA
## month                NA          NA
## day                  NA          NA
## hour                 NA          NA
## lat                  NA          NA
## long                 NA          NA
## wind                 NA          NA
## pressure             NA          NA
## ts_diameter           1          NA
## hu_diameter          NA           1

The result is a matrix of pairwise correlations. There are several
problems with this:

  • Matrices are great for linear algebra but terrible for visual
    inspection. This particular matrix is wide and has been truncated
    and spread over multiple lines.
  • It isn’t easy to tell which variables are most or least correlated
    by eye-balling this matrix, it’s a jumble of numbers and the row and
    column indices aren’t easy to follow.
  • Nearly half of the output is totally unnecessary: correlation
    matrices are always symmetric, which means that you only need about
    half of what is printed.
  • It’s tricky to do any further analysis of the coefficients in this
    format – a dataframe would be handy!
3. cor() doesn’t produce confidence intervals

If possible, we should try to interpret point estimates in the context
of their sampling distribution, for example by considering a confidence

Confidence intervals aren’t available using cor(), although can be
generated using cor.test(). A big draw back here is that intervals and
perform hypothesis tests can only be performed one at a time – we may
want this for many (or all) correlation coefficients.

4. cor() and cor.test() don’t provide visualisation methods out of the box

Tables are all very well, but it’s much easier to use graphics to
visually interrogate correlations. There are many other packages that do
help with this, but in general they use a matrix or grid plot with
coloured cells to display correlations which are typically messy and
difficult to read.

Using inspect_cor() to calculate correlations

inspect_cor() attempts to address some of the issues above. To
calculate correlations for the storms data, simply run

<span class="n">storms</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">inspect_cor</span><span class="p">()</span><span class="w">
## # A tibble: 45 x 6
##    col_1       col_2         corr  p_value  lower  upper
##    <chr>       <chr>        <dbl>    <dbl>  <dbl>  <dbl>
##  1 pressure    wind        -0.942 0.       -0.945 -0.940
##  2 hu_diameter pressure    -0.842 0.       -0.853 -0.831
##  3 hu_diameter wind         0.774 0.        0.758  0.788
##  4 hu_diameter ts_diameter  0.684 0.        0.663  0.704
##  5 ts_diameter pressure    -0.683 0.       -0.703 -0.663
##  6 ts_diameter wind         0.640 0.        0.617  0.662
##  7 ts_diameter lat          0.301 1.25e-73  0.266  0.335
##  8 day         month       -0.183 3.59e-76 -0.205 -0.161
##  9 hu_diameter lat          0.164 1.59e-22  0.127  0.201
## 10 ts_diameter month        0.139 1.67e-16  0.102  0.176
## # … with 35 more rows

The result is tabular rather than a matrix. Together, the first two
columns contain the names of every unique pair of numeric columns, while
the corr column contains the correlation coefficients. For example,
the first row says that the correlation between pressure and wind is
about -0.942. The rows are arranged in descending order of the
absolute correlation – making it easy to see which pairs are most
strongly correlated.

The p_value column contains p-values associated with the null
hypothesis that the true correlation coefficient is 0. The lower and
upper columns contain the lower and upper reaches of a 95% confidence
interval. In this case, the confidence interval for the correlation
between pressure and wind is (-0.945, -0.940). The interval type
can be changing the alpha argument in inspect_cor(), for example 90%
confidence intervals can be generated using inspect_cor(storms, alpha
= 0.1)

Using show_plot() to visualise correlation coefficients

The dataframe of coefficients above is already a bit easier to handle
than cor()’s matrix output. We can go further and visualise these
graphically using show_plot():

<span class="n">storms</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">inspect_cor</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">show_plot</span><span class="p">()</span><span class="w">

Some key points

  • Each row in the plot corresponds to a unique pair of numeric
    columns, the correlation coefficient is show as a black vertical
  • The gray and pink bars around the coefficients are the confidence
  • The gray bars are confidence intervals that straddle 0 (also shown
    by the long vertical dashed line) indicating that the true
    coefficient is not significantly different to 0.

A side note that is not specific to inspect_cor() is that we should be
careful when interpreting the significance of individual coefficients
when there are many correlation coefficients overall. For example, if
alpha = 0.05 and all of the true coefficients are 0, we’d still expect
to see 1 in 20 significant coefficients just by chance.

Using inspect_cor() and show_plot() to visualise the correlation with a single feature

Another common exploratory step is to assess the linear association
between possible predictor variables and a target variable, often as a
precursor to regression analysis or building a predictive model.

As an example, suppose we’d like to see which features of a storm are
most strongly correlated with wind, the maximum sustained wind speed
of the storm. We don’t need to calculate all correlation coefficients
for this (for big data sets this is time consuming), only the ones that
involve the wind variable.

With inspect_cor() this is also straightforward, by simply adding the
argument with_col = "wind":

<span class="n">storms</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">inspect_cor</span><span class="p">(</span><span class="n">with_col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wind"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">show_plot</span><span class="p">()</span><span class="w">

The strongest association here is with pressure, the air pressure at
the storm’s center. I have very little meteorological experience but it
seems sensible that those should be strongly associated.

Comments? Suggestions? Issues?

Any feedback is welcome! Find me on twitter at
rushworth_a or write a github

To leave a comment for the author, please follow the link and comment on their blog: Alastair Rushworth. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)