No matter how sophisticated you get with your statistical analysis, you’ll usually start off exploring your data the same way. If you’re looking at a single variable, you’ll want to create a histogram to look at the distribution. If you’re trying to compare two variables to see if there is a relationship between them, you’ll want to create a scatter plot. R makes both of these really easy. Let’s look at some code…
motorcars <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv", stringsAsFactors = FALSE) par(mfrow = c(1,3)) hist(motorcars$wt, main = "Weight", xlab = "Weight (in 1000s)") hist(motorcars$mpg, main = "MPG", xlab = "Miles Per Gallon") plot(motorcars$wt, motorcars$mpg, main = "Weight Vs MPG", xlab = "Weight (in 1000s)", ylab = "Miles Per Gallon")
First, we pull some data from the web and read it into a data frame called “motorcars.” This data frame compares 32 cars across 11 different variables. You can get an overview of the data frame by calling str(), summary(), or head() on the data frame (i.e. str(motorcars)).
In this example, we are wanting to look specifically at the weight of each car and the miles per gallon clocked for each car. We want to see how the distributions of these variables are spread out, and we also want to see if there is any relationship between them. Does a heavier car actually have an effect on the car’s fuel economy, or is that just an urban legend?
After we pull the data, we use the par() function to describe how we want the plots to be displayed. The mfrow argument creates a matrix of rows and columns that serves as the layout for the plots. In this case, we’ll want three plots laid side-by-side (1 row and 3 columns).
Next, we create the histograms, subsetting columns we are looking for by wedging the $ between the data frame’s name and the column’s name. Technically, this is all we have to do to create the histogram. The other arguments I’ve included are optional arguments to give histograms names and x-axis labels.
Finally, we’ll create the scatter plot with the plot() function. The plot() function takes two arguments: the first is what we want plotted on the x-axis, and the second is what we want plotted on the y-axis. In this case, we’ve chosen to use the weight on the x-axis and the miles per gallon on the y-axis.
After you run this code, here are the plots that are generated:
So, what do you think?
What is the weight range of most motor vehicles? What MPG do most motor vehicles have? Is there any relationship between the two of them?