Visualizations for correlation matrices in R

October 23, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Basic Statistics

    Tags

    1. Correlation
    2. Data Visualisation
    3. R Programming

    When working with data it is helpful to build a correlation matrix to describe data and the associations between variables. In this article, you learn how to use visualizations for correlation matrices in R.

    Read packages into R library

    First we need to read the packages into the R library. For descriptive statistics of the dataset we use the skimr package and for visualization of correlation matrix we use the corrplot package. We will work with windspeed dataset from the bReeze package:

    # Read packages into R library
    library(bReeze)
    library(corrplot)
    library(skimr)
    

    Read dataset & data management

    Next it is time to read the dataset and do some data management:

    # Read datset to R
    data("winddata")
    colnames(winddata)
    # Data management for visualization analysis
    winddata.cor<-as.data.frame(lapply(winddata[c(2:17)], as.numeric)) # Make variables numeric for cor analysis
    # Descriptive statistics of dataset
    skim(winddata.cor)
    Skim summary statistics
     n obs: 36548 
     n variables: 16 
    
    -- Variable type:numeric -------------------------------------------------------
         variable missing complete     n   mean     sd p0   p25    p50    p75   p100     hist
     dir1_40m_avg       0    36548 36548 174.44 113.45  0 47.85 200    247.34 360    ????????
     dir1_40m_std       0    36548 36548  14.81  11.08  0  8.33  12.43  17.07  80.42 ????????
     dir2_30m_avg       0    36548 36548 172.6  112.63  0 46.89 197    244.32 360    ????????
     dir2_30m_std       0    36548 36548  14.97  11.65  0  8.26  11.66  17.18  79.62 ????????
       v1_40m_avg       0    36548 36548   4.47   3.19  0  1.97   4.11   6.32  20.62 ????????
       v1_40m_max       0    36548 36548   6.75   4.24  0  3.79   6.44   9.1   30.35 ????????
       v1_40m_min       0    36548 36548   2.54   2.27  0  0.37   2.27   3.79  15.17 ????????
       v1_40m_std       0    36548 36548   0.82   0.47  0  0.52   0.78   1.1    4.26 ????????
       v2_30m_avg       0    36548 36548   4.26   3.09  0  1.85   3.91   6.01  19.98 ????????
       v2_30m_max       0    36548 36548   6.57   4.18  0  3.38   6.04   9.08  30.36 ????????
       v2_30m_min       0    36548 36548   2.35   2.15  0  0.34   1.86   3.38  14.78 ????????
       v2_30m_std       0    36548 36548   0.83   0.47  0  0.52   0.79   1.1    4.38 ????????
       v3_20m_avg       0    36548 36548   4.12   2.98  0  1.81   3.76   5.76  19.5  ????????
       v3_20m_max       0    36548 36548   6.45   4.09  0  3.41   6.06   8.72  29.21 ????????
       v3_20m_min       0    36548 36548   2.23   2.05  0  0.37   1.89   3.41  14.41 ????????
       v3_20m_std       0    36548 36548   0.83   0.46  0  0.52   0.79   1.11   4.04 ????????
    

    Dataset is clean and with no missing values so we can begin the visualization analysis.

    Visualizations for correlation matrix

    First let us make a correlation matrix table:

    # Create correlation matrix of data
    res <- cor(winddata.cor) # Corr matrix
    round(res, 2)
                 v1_40m_avg v1_40m_max v1_40m_min v1_40m_std v2_30m_avg v2_30m_max v2_30m_min v2_30m_std v3_20m_avg v3_20m_max
    v1_40m_avg         1.00       0.97       0.95       0.72       1.00       0.97       0.95       0.74       0.99       0.97
    v1_40m_max         0.97       1.00       0.89       0.83       0.97       0.99       0.89       0.84       0.96       0.98
    v1_40m_min         0.95       0.89       1.00       0.52       0.95       0.89       0.98       0.55       0.95       0.89
    v1_40m_std         0.72       0.83       0.52       1.00       0.71       0.83       0.53       0.98       0.70       0.82
    v2_30m_avg         1.00       0.97       0.95       0.71       1.00       0.97       0.95       0.74       0.99       0.97
    v2_30m_max         0.97       0.99       0.89       0.83       0.97       1.00       0.89       0.85       0.97       0.99
    v2_30m_min         0.95       0.89       0.98       0.53       0.95       0.89       1.00       0.55       0.95       0.89
    v2_30m_std         0.74       0.84       0.55       0.98       0.74       0.85       0.55       1.00       0.73       0.84
    v3_20m_avg         0.99       0.96       0.95       0.70       0.99       0.97       0.95       0.73       1.00       0.97
    v3_20m_max         0.97       0.98       0.89       0.82       0.97       0.99       0.89       0.84       0.97       1.00
    v3_20m_min         0.94       0.88       0.97       0.53       0.94       0.88       0.98       0.55       0.95       0.89
    v3_20m_std         0.75       0.85       0.57       0.95       0.75       0.85       0.57       0.97       0.75       0.86
    dir1_40m_avg      -0.10      -0.06      -0.11       0.03      -0.07      -0.04      -0.08       0.03      -0.07      -0.04
    dir1_40m_std      -0.33      -0.23      -0.37       0.04      -0.33      -0.23      -0.37       0.01      -0.32      -0.23
    dir2_30m_avg      -0.03       0.02      -0.06       0.10       0.00       0.03      -0.03       0.11       0.00       0.03
    dir2_30m_std      -0.37      -0.27      -0.41       0.00      -0.37      -0.27      -0.40      -0.03      -0.37      -0.28
                 v3_20m_min v3_20m_std dir1_40m_avg dir1_40m_std dir2_30m_avg dir2_30m_std
    v1_40m_avg         0.94       0.75        -0.10        -0.33        -0.03        -0.37
    v1_40m_max         0.88       0.85        -0.06        -0.23         0.02        -0.27
    v1_40m_min         0.97       0.57        -0.11        -0.37        -0.06        -0.41
    v1_40m_std         0.53       0.95         0.03         0.04         0.10         0.00
    v2_30m_avg         0.94       0.75        -0.07        -0.33         0.00        -0.37
    v2_30m_max         0.88       0.85        -0.04        -0.23         0.03        -0.27
    v2_30m_min         0.98       0.57        -0.08        -0.37        -0.03        -0.40
    v2_30m_std         0.55       0.97         0.03         0.01         0.11        -0.03
    v3_20m_avg         0.95       0.75        -0.07        -0.32         0.00        -0.37
    v3_20m_max         0.89       0.86        -0.04        -0.23         0.03        -0.28
    v3_20m_min         1.00       0.56        -0.06        -0.36        -0.01        -0.40
    v3_20m_std         0.56       1.00         0.01        -0.01         0.08        -0.06
    dir1_40m_avg      -0.06       0.01         1.00         0.06         0.84         0.13
    dir1_40m_std      -0.36      -0.01         0.06         1.00         0.07         0.86
    dir2_30m_avg      -0.01       0.08         0.84         0.07         1.00         0.12
    dir2_30m_std      -0.40      -0.06         0.13         0.86         0.12         1.00
    

    The above table is quite hard to read and you end up with a lot of correlation numbers that is hard to interpret. Let us make them into a correlation matrix visualization with the package corrplot:

    corrplot(cor(winddata.cor), method = "circle")
    

    The above coding gives us the following correlation matrix visualization:

    Now it is also possible to make the correlation matrix with other shapes:

    corrplot(cor(winddata.cor), method = "ellipse")
    corrplot(cor(winddata.cor), method = "square")
    corrplot(cor(winddata.cor), method = "color")
    

    The above coding gives us the following three correlation matrix visualizations:

    Ellipse shape:

    Square shape:

    Color shape:

    Lastly you can depict the visualization correlation matrix using only the upper square:

    corrplot(cor(winddata.cor), method = "color", type = "upper")
    

    This gives us the following correlation matrix visualization:

    Upper color shape

    References

    1. Using bReeze in R – CRAN.R-project.org
    2. Using skimr in R – CRAN.R-project.org
    3. Using corrplot in R – CRAN.R-project.org

    Related Post

    1. Interpretation of the AUC
    2. Simple Experiments with Smoothed Scatterplots
    3. Understanding the Covariance Matrix
    4. Six Sigma DMAIC Series in R – Part 3
    5. Prediction Interval, the wider sister of Confidence Interval

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)