Graphically analyzing variable interactions in R

[This article was first published on Modern Toolmaking, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I studied Ecology as an undergraduate, which meant I spent a lot of time gathering and analyzing field data. One of the basic tools we used to look for relationships in a large set of variables was correlation and scatterplot matrices. Each of these requires a single line of code in R:



The ‘pairs’ function in R contains a lot of additional options, which can be used to make very informative plots. These options can get a little cumbersome, but fortunately several package authors have written wrapper functions that automatically enable some extra magic. Two such packages are psych and PerformanceAnalytics. I happen to prefer the 1 liner from PerformanceAnalytics, but it’s a matter of personal taste:


This chart contains a LOT of information: On the diagonal are the univariate distributions, plotted as histograms and kernel density plots. On the right of the diagonal are the pair-wise correlations, with red stars signifying significance levels. As the correlations get bigger the font size of the coefficient gets bigger. On the left side of the diagonal is the scatter-plot matrix, with loess smoothers in red to help illustrate the underlying relationship. This is one of my favorite plots in R, because it combines a large amount of information into one command and one easy to follow plot. In fact, this plot contains more information than is revealed by the 1st two commands in this post!

Of course, you can use this command on data from other domains besides Ecology. PerformanceAnalytics is intended for the analysis of financial data, so lets put it through its paces. First we download some financial data (a stock index, a bond index, and a gold index) from yahoo finance using quantmod, and then combine the daily close series of those indexes into one dataframe. I’m not 100% happy with the legend in the plot, but I wanted to show how the correlations between these indexes have changed over the years. I also skipped red (color #2) in the plots and in the legend, because the loess smoother is also red.



Finally, I’d like to acknowledge Stephen Turner over on cross-validated for inspiring this post.


To leave a comment for the author, please follow the link and comment on their blog: Modern Toolmaking.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)