So, in my last post, I showed how to create two histograms from a certain data set and then how to plot the two variables to see if there is any relationship. Visually, it was easy to tell that there was a negative relationship between the weight of an automobile and the fuel economy of an automobile. But, is there a more objective way to understand the relationship? Is there a number we can assign to it?
Yes, it turns out there is. This number is called Pearson’s Correlation Coefficient or, in the vernacular, simply the “correlation.” Essentially, this number measures the percentage of fluctuation in one variable that can be explained by another variable. A correlation of 1 means the variables move in perfect unison, a correlation of -1 means the variables move in the complete opposite direction, and a correlation of 0 means there is no relationship at all between the two variables.
So, how to we retrieve the correlation between two variables in R? Let’s write some code…
motorcars <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", stringsAsFactors = FALSE) cor(motorcars$wt, motorcars$mpg) plot(motorcars$wt, motorcars$mpg)
First, we import the same data set we used last time. When we view the data set (using colnames() or head()), we see that the column names for the variables we are trying to measure are “wt” and “mpg.” Now, all we need to do is subset these two variables with the dollar sign and place them within the cor() function.
When we run this code, we can see that the correlation is -0.87, which means that the weight and the mpg move in exactly opposite directions roughly 87% of the time. So, that’s it. You’ve run a correlation in R. If you plot the two variables using the plot() function, you can see that this relationship is fairly clear visually.
But, wait? Could there be other things that are related to the fuel economy of the vehicle, besides weight? What else is in the data set? Let’s have a look. When we run the head() function on motorcars, we get the first 6 rows of every column in the data set.
What if we want to see how all of these variables are related to one another? Well, we could run a correlation on every single combination we can think of, but that would be tedious. Is there a way we can view all the correlations with a single line of code? Yes, there is.
mc_data <- motorcars[,2:length(motorcars)] round(cor(mc_data),2)
First, we create a separate data frame that only includes the data from motorcars (subsets everything to the right of the vehicle model name). Then, we simply run a correlation on the new data frame, which we’ve called “mc_data.” To clean things up a bit, I’ve nested the cor() function within the round() function to round the result to two decimal places. When we enter this code, here’s what we get:
We can see that there are several other variables that are related to mpg, such as cyl, disp, and hp. Now, we can plot the variables that are most correlated with miles per gallon using this code (refer to previous post for explanation).
par(mfrow = c(2,2)) plot(motorcars$wt, motorcars$mpg) plot(motorcars$cyl, motorcars$mpg) plot(motorcars$disp, motorcars$mpg) plot(motorcars$hp, motorcars$mpg)
And, here’s what we get as a result…