**Alastair Rushworth**, and kindly contributed to R-bloggers)

### Calculating and visualising correlation coefficients with `inspectdf`

(and why correlations matrices make life hard)

In a previous post, we explored categorical data using the `inspectdf`

package.

In this post, we tackle a different exploratory problem of calculating

and visualising correlation coefficients. To install `inspectdf`

from

CRAN, you’ll first need to run:

```
installed.packages("inspectdf")
```

We’ll begin the tutorial by loading the `inspectdf`

and `dplyr`

packages, the latter we’ll need for some dataframe manipulation.

```
library(inspectdf)
library(dplyr)
```

For this walk-through, we’ll explore the `storms`

dataset which comes

from the `dplyr`

package and has many numeric columns. The data includes

the positions and attributes of 198 tropical storms, measured every six

hours during the lifetime of a storm.

```
# check out the storms dataset
?storms
```

#### What’s wrong with `cor()`

?

Most R users will be familiar with the built-in `stats`

function,

`cor()`

which can be used to produce a matrix of correlation

coefficients of pairs of numeric variables. So why not just use this?

Here’s a short list of pain points that occur when using this function:

##### 1. `cor()`

requires numeric inputs only

Correlations are only defined for numeric pairs of variables, so perhaps

this shouldn’t be a surprise. But it means we can’t simply pass a

dataframe with mixed types to `cor()`

and expect that it will be smart

enough to return correlations for just the numeric columns. Consequently

this fails:

```
cor(storms)
```

```
## Error in cor(storms): 'x' must be numeric
```

##### 2. Correlation matrices are hard to read

It isn’t hard get what we want from `cor()`

by first selecting the

numeric columns using a bit of `dplyr`

:

```
cor(storms %>% select_if(is.numeric))
```

```
## year month day hour
## year 1.000000000 -0.011488006 0.0183703369 0.0015741629
## month -0.011488006 1.000000000 -0.1830702018 -0.0051201358
## day 0.018370337 -0.183070202 1.0000000000 0.0007164624
## hour 0.001574163 -0.005120136 0.0007164624 1.0000000000
## lat -0.121252667 -0.065922836 -0.0508598742 0.0026823666
## long 0.060387523 0.048382680 0.0406477301 -0.0091876627
## wind 0.048966015 0.126682358 -0.0064971154 0.0018333102
## pressure -0.072615741 -0.134238300 -0.0010113895 0.0016030589
## ts_diameter NA NA NA NA
## hu_diameter NA NA NA NA
## lat long wind pressure
## year -0.121252667 0.060387523 0.048966015 -0.072615741
## month -0.065922836 0.048382680 0.126682358 -0.134238300
## day -0.050859874 0.040647730 -0.006497115 -0.001011389
## hour 0.002682367 -0.009187663 0.001833310 0.001603059
## lat 1.000000000 -0.104014683 0.076141764 -0.103772744
## long -0.104014683 1.000000000 0.004737422 0.058467333
## wind 0.076141764 0.004737422 1.000000000 -0.942249266
## pressure -0.103772744 0.058467333 -0.942249266 1.000000000
## ts_diameter NA NA NA NA
## hu_diameter NA NA NA NA
## ts_diameter hu_diameter
## year NA NA
## month NA NA
## day NA NA
## hour NA NA
## lat NA NA
## long NA NA
## wind NA NA
## pressure NA NA
## ts_diameter 1 NA
## hu_diameter NA 1
```

The result is a matrix of pairwise correlations. There are several

problems with this:

- Matrices are great for linear algebra but
**terrible**for visual

inspection. This particular matrix is wide and has been truncated

and spread over multiple lines. - It isn’t easy to tell which variables are most or least correlated

by eye-balling this matrix, it’s a jumble of numbers and the row and

column indices aren’t easy to follow. - Nearly half of the output is totally unnecessary: correlation

matrices are always symmetric, which means that you only need about

half of what is printed. - It’s tricky to do any further analysis of the coefficients in this

format – a dataframe would be handy!

##### 3. `cor()`

doesn’t produce confidence intervals

If possible, we should try to interpret point estimates in the context

of their sampling distribution, for example by considering a confidence

interval.

Confidence intervals aren’t available using `cor()`

, although can be

generated using `cor.test()`

. A big draw back here is that intervals and

perform hypothesis tests can only be performed *one at a time* – we may

want this for many (or all) correlation coefficients.

##### 4. `cor()`

and `cor.test()`

don’t provide visualisation methods out of the box

Tables are all very well, but it’s much easier to use graphics to

visually interrogate correlations. There are many other packages that do

help with this, but in general they use a matrix or grid plot with

coloured cells to display correlations which are typically messy and

difficult to read.

#### Using `inspect_cor()`

to calculate correlations

`inspect_cor()`

attempts to address some of the issues above. To

calculate correlations for the `storms`

data, simply run

```
storms %>% inspect_cor()
```

```
## # A tibble: 45 x 6
## col_1 col_2 corr p_value lower upper
##
```
## 1 pressure wind -0.942 0. -0.945 -0.940
## 2 hu_diameter pressure -0.842 0. -0.853 -0.831
## 3 hu_diameter wind 0.774 0. 0.758 0.788
## 4 hu_diameter ts_diameter 0.684 0. 0.663 0.704
## 5 ts_diameter pressure -0.683 0. -0.703 -0.663
## 6 ts_diameter wind 0.640 0. 0.617 0.662
## 7 ts_diameter lat 0.301 1.25e-73 0.266 0.335
## 8 day month -0.183 3.59e-76 -0.205 -0.161
## 9 hu_diameter lat 0.164 1.59e-22 0.127 0.201
## 10 ts_diameter month 0.139 1.67e-16 0.102 0.176
## # … with 35 more rows

The result is tabular rather than a matrix. Together, the first two

columns contain the names of every unique pair of numeric columns, while

the `corr`

column contains the correlation coefficients. For example,

the first row says that the correlation between `pressure`

and `wind`

is

about -0.942. The rows are arranged in descending order of the

*absolute* correlation – making it easy to see which pairs are most

strongly correlated.

The `p_value`

column contains p-values associated with the null

hypothesis that the true correlation coefficient is 0. The `lower`

and

`upper`

columns contain the lower and upper reaches of a 95% confidence

interval. In this case, the confidence interval for the correlation

between `pressure`

and `wind`

is `(-0.945, -0.940)`

. The interval type

can be changing the `alpha`

argument in `inspect_cor()`

, for example 90%

confidence intervals can be generated using `inspect_cor(storms, alpha`

.

= 0.1)

##### Using `show_plot()`

to visualise correlation coefficients

The dataframe of coefficients above is already a bit easier to handle

than `cor()`

’s matrix output. We can go further and visualise these

graphically using `show_plot()`

:

```
storms %>%
inspect_cor() %>%
show_plot()
```

Some key points

- Each row in the plot corresponds to a unique pair of numeric

columns, the correlation coefficient is show as a black vertical

line. - The gray and pink bars around the coefficients are the confidence

intervals. - The gray bars are confidence intervals that straddle 0 (also shown

by the long vertical dashed line) indicating that the true

coefficient is not significantly different to 0.

A side note that is not specific to `inspect_cor()`

is that we should be

careful when interpreting the significance of individual coefficients

when there are many correlation coefficients overall. For example, if

`alpha = 0.05`

and all of the true coefficients are 0, we’d still expect

to see 1 in 20 significant coefficients *just by chance*.

##### Using `inspect_cor()`

and `show_plot()`

to visualise the correlation with a single feature

Another common exploratory step is to assess the linear association

between possible predictor variables and a target variable, often as a

precursor to regression analysis or building a predictive model.

As an example, suppose we’d like to see which features of a storm are

most strongly correlated with `wind`

, the maximum sustained wind speed

of the storm. We don’t need to calculate all correlation coefficients

for this (for big data sets this is time consuming), only the ones that

involve the `wind`

variable.

With `inspect_cor()`

this is also straightforward, by simply adding the

argument `with_col = "wind"`

:

```
storms %>%
inspect_cor(with_col = "wind") %>%
show_plot()
```

The strongest association here is with `pressure`

, the air pressure at

the storm’s center. I have very little meteorological experience but it

seems sensible that those should be strongly associated.

#### Comments? Suggestions? Issues?

Any feedback is welcome! Find me on twitter at

rushworth_a or write a github

issue.

**leave a comment**for the author, please follow the link and comment on their blog:

**Alastair Rushworth**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...