Using R for Introductory Statistics 3.3

August 11, 2010

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

…continuing our way though John Verzani’s Using R for introductory statistics. Previous installments: chapt1&2, chapt3.1, chapt3.2

Relationships in numeric data

If two data series have a natural pairing (x1,y1),…,(xn,yn), then we can ask, “What (if any) is the relationship between the two variables?” Scatterplots and correlation are first-line ways of assessing a bivariate data set.

Pearson’s correlation

The Pearson’s correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. It ranges from 1 for perfectly correlated variables to -1 for perfectly anticorrelated variables. 0 means uncorrelated. Independent variables have a correlation coefficient close to 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. [see wikipedia entry on correlation]

Question 3.19 concerns a sampling of 1000 New York Marathon runners. We’re asked whether we expect a correlation between age and finishing time.

cor(age, time)
[1] 0.1898672
cor(age, time, method="spearman")

We discover a low correlation – good news for us wheezing old geezers. A scatterplot might show something. And we have the gender of each runner, so let’s use that, too.

First, let’s set ourselves up for a two panel plot.


Next let’s set up colors – pink for ladies, blue for guys – and throw in some transparency because a lot of data points are on top of each other.

blue = rgb(0,0,255,64, maxColorValue=255)
pink = rgb(255,192,203,128, maxColorValue=255)

color <- rep(blue, length(gender))
color[gender=='Female'] <- pink

In the first panel, draw the scatter plot.

plot(time, age, col=color, pch=19, main="NY Marathon", ylim=c(18,80), xlab="")

And in the second panel, break it down by gender. It’s a well kept secret that outcol and outpch can be used to set the color and shape of the outliers in a boxplot.

boxplot(time ~ gender, horizontal=T, col=c(pink, blue), outpch=19, outcol=c(pink, blue), xlab="finishing time (minutes)")

Now return our settings to normal for good measure.


Sure enough, there doesn’t seem to be much correlation between age and finishing time. Gender has an effect, although I’m sure the elite female runners would have little trouble dusting my slow booty off the trail.

It looks like we have fewer data points for women. Let’s check that. We can use table to count the number of times each level of a factor occurs, or in other words, count the number of males and females.

Female   Male 
   292    708

I’m still a little skeptical of our previous result – the low correlation between age and finishing time. Let’s look at the data binned by decade.

bins <- cut(age, include.lowest=T, breaks=c(20,30,40,50,60,70,100), right=F, labels=c('20s','30s','40s','50s','60s','70+'))
boxplot(time ~ bins, col=colorRampPalette(c('green','yellow','brown'))(6), ylim=c(570,140))

It looks like you’re not washed up as a runner until your 50’s. Things go down hill from there, but, it doesn’t look very linear, so we shouldn’t be too surprised about our low r.

Coarser bins, old and young using 50 as our cut-off, reveal that there’s really no correlation in the younger group. In the older group, we’re starting to see some correlation. I suppose you could play with the numbers to find an optimum cut-off that maximized the difference in correlation. Not sure what the point of that would be.

y <- nym.2002[age<50,]
[1] -0.01148919
cor(y$age,y$time, method='spearman')
[1] -0.01512368
o <- nym.2002[age>=50,]
cor(o$age, o$time)
[1] 0.3813543
cor(o$age, o$time, method='spearman')
[1] 0.1980635

I ran a marathon once in my life. I think I was 30 and my time was a pokey 270 or so. My knees hurt for days afterwards, so I’m not sure I’d try it again. I do want to do a half, though. Gotta get back in shape for that…

More on Using R for Introductory Statistics

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)