R Tutorial Series: Scatterplots

November 12, 2009
By

(This article was first published on R Tutorial Series, and kindly contributed to R-bloggers)

A scatterplot is a useful way to visualize the relationship between two variables. Similar to correlations, scatterplots are often used to make initial diagnoses before any statistical analyses are conducted. This tutorial will explore the ways in which R can be used to create scatterplots.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains pre and post test scores for 66 subjects on a series of reading comprehension tests (Moore & McCabe, 1989). Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Plotting Two Variables

The simplest way to create a scatterplot is to directly graph two variables using the default settings. In R, this can be accomplished with the plot(XVAR, YVAR) function, where XVAR is the variable to plot along the x-axis and YVAR is the variable to plot along the y-axis. Suppose that we want to get a picture of the relationship between pretest 1 (PRE1) and posttest 1 (POST1). The following example demonstrates how to use the plot(XVAR, YVAR) function to visualize this relationship.

  1. #create a scatterplot of Y on X using plot(XVAR, YVAR)
  2. #what does the relationship between pretest 1 and posttest 1 look like?
  3. plot(PRE1, POST1)

The output of the preceding function is pictured below.

Plotting All Variables

When beginning to analyze a dataset, researchers often want to get a complete picture of all relationships, rather than just a single one. Conveniently, the plot() function can also be run on an entire set of data. The format for this operation is plot(DATAVAR), where DATAVAR is the name of the R variable containing the data. Suppose now that our interest is in visualizing all of the scatterplots at once, in order to diagnose the various relationships present in our data. The following example demonstrates how to use the plot(DATAVAR) function.

  1. #create scatterplots of all variables using plot(DATAVAR)
  2. #what do all of the relationships in the data look like?
  3. plot(datavar)

The output of the preceding function is pictured below.

Note that the image above has been resized to fit on this page. In the R Quartz Window, the scatterplots could be made much larger for easier viewing.

Custom Plotting

Additional Plot() Arguments

Up to this point, we have been using the default values for all of our scatterplots' elements. However, R also allows for the customization of scatterplots. In addition to x and y axis variables, the plot() function also accepts the following arguments ("The Default Scatterplot Function", n.d.).

  • main: the title for the plot (displayed at the top)
  • sub: the subtitle for the plot (displayed at the bottom)
  • xlim: the x-axis scale; uses the format c(min, max); automatically determined by default
  • ylim: the y-axis scale; uses the format c(min, max); automatically determined by default
  • xlab: the x-axis title
  • ylab: the y-axis title
  • Even more arguments are accepted by the plot() function. Take a look at the referenced page if you wish to explore further options.

Now let's recreate the original plot depicting the relationship between pretest 1 and posttest 1 with more detailed and meaningful parameters.

  1. #create a detailed scatterplot of Y on X incorporating the optional arguments of the plot() function
  2. #set axis scales for x and y to range between 0 and 20
  3. #set main title and subtitle
  4. #set x and y axis labels
  5. plot(PRE1, POST1, xlim = c(0, 20), ylim = c(0, 20), main = "Posttest 1 on Pretest 1", sub = "A Scattered Tale", xlab = "Pretest 1 Score", ylab = "Posttest 1 Score")

The output of the preceding function is pictured below.

Advanced Plotting

There are numerous graphical arguments available to functions in R. In this tutorial, just a few of the common aesthetic options will be addressed below ("Set or Query Graphical Parameters", n.d.).

  • col: determines the colors used for points and lines; accepts character strings of color names (i.e. "red", "green", etc.)
  • pch: the type of point to use (i.e. circle, square, triangle, etc.); accepts values 0-25 for symbols and 32-255 for characters
  • cex: the amount to scale the size of points; accepts a numeric value; default is 1
  • lty: defines the line type; accepts various character strings (i.e. "solid", "dashed", "dotted", etc.)
  • lwd: defines the line width; accepts a positive number; default is 1

Even more arguments are accepted by the plot() function. Take a look at the referenced page if you wish to explore further options.

Now let's recreate the plot of posttest 1 on pretest 1 yet again, but this time with the inclusion of customized aesthetic parameters.

  1. #create a scatterplot of Y on X incorporating the custom aesthetic parameters of the plot() function
  2. #set point colors to dark green, red, and orange
  3. #set point markers to circle, square, and diamond
  4. #set point size to three times the default
  5. #set lines to be solid and three times the default thickness
  6. plot(PRE1, POST1, xlim = c(0, 20), ylim = c(0, 20), main = "Posttest 1 on Pretest 1", sub = "A Scattered Tale", xlab = "Pretest 1 Score", ylab = "Posttest 1 Score", col = c("dark green", "red", "orange"), pch = c(21, 22, 23), cex = 3, lty = "solid", lwd = 3)

The output of the preceding function is pictured below.

Note that the c() function is used for a number of the parameters in the plot function above. This allows one to define multiple values as a "vector" that can be fed into a single argument. For example, if one wanted to use only a single line color, then col = "red" would be acceptable. However, to use multiple colors, all items must be placed into a vector such as col = c("red", "green", "blue"). Without using a vector for multiple colors, as in col = "red", "green", "blue", an error would occur because the colors would be treated as separate arguments rather than a single entity.

Complete Plot Examples

To see a complete example of how scatterplots can be created in R, please download the plot examples (.txt) file.

Even More Visualizations

R has much more sophisticated graphic capabilities than have been demonstrated in this tutorial. In fact, opportunities exist to make very complex and unique visuals. To see examples of the kinds of charts that can be generated with R, I recommend that you visit the R Graph Gallery (François, 2006).

References

François, R. (2006). R graph gallery: Enhance your data visualization with R. Retrieved November 11, 2009 from http://addictedtor.free.fr/graphiques

Moore, D., and McCabe, G. (1989). Introduction to the practice of statistics [Data File]. Retrieved October 27, 2009 from http://lib.stat.cmu.edu/DASL/Datafiles/ReadingTestScores.html

Set or Query Graphical Parameters. (n.d.). Retrieved November 11, 2009 from http://sekhon.berkeley.edu/graphics/html/par.html

The Default Scatterplot Function. (n.d.). Retrieved November 11, 2009 from http://sekhon.berkeley.edu/graphics/html/plotdefault.html

To leave a comment for the author, please follow the link and comment on his blog: R Tutorial Series.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.