What’s the Scatter?
A scatter plot displays the values of 2 variables for a set of data, and it is a very useful way to visualize data during exploratory data analysis, especially (though not exclusively) when you are interested in the relationship between a predictor variable and a target variable. Sometimes, such data come with categorical labels that have important meanings, and the visualization of the relationship can be enhanced when these labels are attached to the data.
It is common practice to use a legend to label data that belong to a group, as I illustrated in a previous post on bar charts and pie charts. However, what if every datum has a unique label, and there are many data in the scatter plot? A legend would add unnecessary clutter in such situations. Instead, it would be useful to write the label of each datum near its point in the scatter plot. I will show how to do this in R, illustrating the code with a built-in data set called LifeCycleSavings.
The LifeCycleSavings Data Set
A data set containing such labels is LifeCycleSavings, a built-in data set in R. Each row contains economic or demographic data for a particular country. In this case, the country is a unique categorical label for each datum. I will plot aggregate personal savings (sr) as a function of real per-capita disposable income (dpi), and I will label each datum with its associated country. Note that I am not saying anything about a predictive relationship in this context; I am simply trying to explore the data in these 2 dimensions, and I may eventually find clustering to be useful for further analysis, as I alluded to earlier in the introduction.
Here are the first 9 data, just to give you a sense of what this data set looks like.
> LifeCycleSavings[1:9,] sr pop15 pop75 dpi ddpi Australia 11.43 29.35 2.87 2329.68 2.87 Austria 12.07 23.32 4.41 1507.99 3.93 Belgium 13.17 23.80 4.43 2108.47 3.82 Bolivia 5.75 41.89 1.67 189.13 0.22 Brazil 12.88 42.19 0.83 728.47 4.56 Canada 8.79 31.72 2.85 2982.88 2.43 Chile 0.60 39.74 1.34 662.86 2.67 China 11.90 44.75 0.67 289.52 6.51 Colombia 4.98 46.64 1.06 276.65 3.08
(It actually isn’t nicely aligned in the output; I manually aligned it for you to make it easier to see each column. )
The plot() and text() Functions
First, let’s use the plot() function to plot the points.
##### Labelling Points in a Scatter Plot ##### By Eric Cai - The Chemical Statistician plot(sr~dpi, xlim = c(0, 3500), xlab = 'Real Per-Capita Disposable Income', ylab = 'Aggregate Personal Savings', main = 'Intercountry Life-Cycle Savings Data', data = LifeCycleSavings[1:9,])
with(LifeCycleSavings[1:9,], text(sr~dpi, labels = row.names(LifeCycleSavings[1:9,]), pos = 4))
The value for the “labels” option looks complicated, but it’s just a vector of strings that I abstracted from the first 9 rows of the names of the “LifeCycleSavings data frame using row.names(), which is a very useful function!
The “pos” option specifies the position of the text relative to the point. I have chosen to use “4″ because I want the text to be to the right of the point.
1 = below
2 = left
3 = above
4 = right
Exporting the Image as a PNG File
Finally, let’s sandwich the two lines of plotting functions with png() and dev.off() to export the image as a PNG file into my chosen directory. Here is the entire script.
png('Insert Your Directory Path Here/savings.png') plot(sr~dpi, xlim = c(0, 3500), xlab = 'Real Per-Capita Disposable Income', ylab = 'Aggregate Personal Savings', main = 'Intercountry Life-Cycle Savings Data', data = LifeCycleSavings[1:9,]) with(LifeCycleSavings[1:9,], text(sr~dpi, labels = row.names(LifeCycleSavings[1:9,]), pos = 4)) dev.off()
Here is the plot.
Why Not attach()?
I could have used the attach() function to set this data set in the search path in R, so that any variable in this data set can be called by simply entering its name. (Of course, it’s good to stop this after using this data set with the detach() function.) This would have made the plotting codes simpler. However, as Nick Horton on R Bloggers points out, this is not a recommended practice.
The alternative script is this:
attach(LifeCycleSavings[1:9,]) png('Insert Your Directory Path Here/savings.png') plot(dpi, sr, xlim = c(0, 3500), xlab = 'Real Per-Capita Disposable Income', ylab = 'Aggregate Personal Savings', main = 'Intercountry Life-Cycle Savings Data') text(dpi, sr, labels = row.names(LifeCycleSavings[1:9,]), pos = 4) dev.off() detach(LifeCycleSavings[1:9,])
Filed under: Plots, R programming Tagged: attach(), data, data visualization, detach(), labels, LifeCycleSavings, plot, plots, plotting, PNG, R, R programming, row.names(), scatter plot, statistics, text