Scatterplot matrices are a great way to roughly determine if you have a linear correlation between multiple variables. This is particularly helpful in pinpointing specific variables that might have similar correlations to your genomic or proteomic data. If you already have data with multiple variables, load it up as described here.
If not, no worries because R comes with some various presaved datasets for practice (some are more interesting than others. To view these datasets, input the following.
For this tutorial, we will be looking at the datasets trees and ChickWeight. First, load or open these datasets.
To see the actual data contained by these datasets, just write the title of the dataset.
- The trees dataset seems to contain three columns of measurements: Girth, Height and Volume.
- The ChickWeight dataset seems to involve little chicklets getting fed different diets and being weighed at various time points.
To find out more information about the datasets and to confirm our observations, put a question mark before the title of the dataset.
Now, you ready for the scatterplot?
This is an example of a scatterplot matrix. The variables are written in a diagonal line from top left to bottom right. Then each variable is plotted against each other. For example, the middle square in the first column is an individual scatterplot of Girth and Height, with Girth as the X-axis and Height as the Y-axis. This same plot is replicated in the middle of the top row. In essence, the boxes on the upper right hand side of the whole scatterplot are mirror images of the plots on the lower left hand.
In this scatterplot, it is probably safe to say that there is a correlation between Girth and Volume (Go data! Confirming the obvious) because the plot looks like a line. There is probably less of a correlation between Height and Girth in addition to Height and Volume. More statistical analyses would be needed to confirm or deny this.
Now for ChickWeight.
This scatterplot matrix is unfortunately not as clean as the last plot because it contains discrete data points for Time, Chick and Diet. However, much can still be extracted from this scatterplot matrix (think about BS exercises you might have done for English or Art) about experimental design and possible outcomes.
- Scatterplots related to Time are evenly distributed into columns or rows, suggesting that data was actually collected in a regimented fashion. (As in, data was collected at the times it should have been for all the Chick samples).
- There were about 50 chicks. The first 20 were on diet 1 and then the next three groups of 10 were given diet 2, 3 or 4.
- Looking at Row 4, Column 1, there is a possibility that chicks on diet 3 gained more weight than chicks on diets 1, 2 or 4.
- Looking at Row 2, Column 1, it seems that chicks weighed about the same amount at the beginning of the experiment but variation increased as time passed on. In general, there is an increase in weight.
There you have it!
- Scatterplot matrices are good for determining rough linear correlations of metadata that contain continuous variables.
- Scatterplot matrices are not so good for looking at discrete variables.