Using R for Introductory Statistics 3.3

Posted on August 11, 2010 by Christopher Bare in R bloggers | 0 Comments

[This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

…continuing our way though John Verzani’s Using R for introductory statistics. Previous installments: chapt1&2, chapt3.1, chapt3.2

Relationships in numeric data

If two data series have a natural pairing (x₁,y₁),…,(x_n,y_n), then we can ask, “What (if any) is the relationship between the two variables?” Scatterplots and correlation are first-line ways of assessing a bivariate data set.

Pearson’s correlation

The Pearson’s correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. It ranges from 1 for perfectly correlated variables to -1 for perfectly anticorrelated variables. 0 means uncorrelated. Independent variables have a correlation coefficient close to 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. [see wikipedia entry on correlation]

Question 3.19 concerns a sampling of 1000 New York Marathon runners. We’re asked whether we expect a correlation between age and finishing time.

attach(nym.2002)
cor(age, time)
[1] 0.1898672
cor(age, time, method="spearman")
0.1119944

We discover a low correlation – good news for us wheezing old geezers. A scatterplot might show something. And we have the gender of each runner, so let’s use that, too.

First, let’s set ourselves up for a two panel plot.

par(mfrow=c(2,1))
par(mar=c(2,4,4,2)+0.1)

Next let’s set up colors – pink for ladies, blue for guys – and throw in some transparency because a lot of data points are on top of each other.

blue = rgb(0,0,255,64, maxColorValue=255)
pink = rgb(255,192,203,128, maxColorValue=255)

color <- rep(blue, length(gender))
color[gender=='Female'] <- pink

In the first panel, draw the scatter plot.

plot(time, age, col=color, pch=19, main="NY Marathon", ylim=c(18,80), xlab="")

And in the second panel, break it down by gender. It's a well kept secret that outcol and outpch can be used to set the color and shape of the outliers in a boxplot.

par(mar=c(5,4,1,2)+0.1)
boxplot(time ~ gender, horizontal=T, col=c(pink, blue), outpch=19, outcol=c(pink, blue), xlab="finishing time (minutes)")

Now return our settings to normal for good measure.

par(mar=c(5,4,4,2)+0.1)
par(mfrow=c(1,1))

Sure enough, there doesn't seem to be much correlation between age and finishing time. Gender has an effect, although I'm sure the elite female runners would have little trouble dusting my slow booty off the trail.

It looks like we have fewer data points for women. Let's check that. We can use table to count the number of times each level of a factor occurs, or in other words, count the number of males and females.

table(gender)
gender
Female   Male 
   292    708

I'm still a little skeptical of our previous result - the low correlation between age and finishing time. Let's look at the data binned by decade.

bins <- cut(age, include.lowest=T, breaks=c(20,30,40,50,60,70,100), right=F, labels=c('20s','30s','40s','50s','60s','70+'))
boxplot(time ~ bins, col=colorRampPalette(c('green','yellow','brown'))(6), ylim=c(570,140))

It looks like you're not washed up as a runner until your 50's. Things go down hill from there, but, it doesn't look very linear, so we shouldn't be too surprised about our low r.

Coarser bins, old and young using 50 as our cut-off, reveal that there's really no correlation in the younger group. In the older group, we're starting to see some correlation. I suppose you could play with the numbers to find an optimum cut-off that maximized the difference in correlation. Not sure what the point of that would be.

y <- nym.2002[age<50,]
cor(y$age,y$time)
[1] -0.01148919
cor(y$age,y$time, method='spearman')
[1] -0.01512368


o <- nym.2002[age>=50,]
cor(o$age, o$time)
[1] 0.3813543
cor(o$age, o$time, method='spearman')
[1] 0.1980635

I ran a marathon once in my life. I think I was 30 and my time was a pokey 270 or so. My knees hurt for days afterwards, so I'm not sure I'd try it again. I do want to do a half, though. Gotta get back in shape for that...

More on Using R for Introductory Statistics

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Using R for Introductory Statistics 3.3

Relationships in numeric data

Pearson’s correlation

More on Using R for Introductory Statistics

Related

Relationships in numeric data

Pearson’s correlation

More on Using R for Introductory Statistics

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)