Analysis of Winter Olympic Medal Data Using R

February 18, 2010
By

(This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers)

The Winter Olympics are on. The Guardian's DataBlog has graciously compiled a database on Winter Olympic Medals. Thus, I thought I'd run a few quick analyses on the data in R. In this post I was hoping to show how one could quickly churn out some basic analyses (and answer some interesting questions) using R.

First, a disclaimer: I ran these analyses in about 45 minutes. Thus, I make no claims of perfect accuracy or in the source data provided by the Guardian. The data also does not include 2010 medals.

Below you will see:
  1. The R Console input and output
  2. The plots
  3. The source code on its own


CONSOLE INPUT AND OUTPUT
> # tips on reading a Google Spreadsheet:
> # http://blog.revolution-computing.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html
> # Data taken from:"https://spreadsheets.google.com/ccc?key=0AgdO92JOXxAOdDVlaUpkNlB2WERtV3l1ZVFYbzllQWc"
> # http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country
>
> googleLink <- "http://spreadsheets.google.com/pub?key=tsddww6vOYePkhPSxRpDeYw&single=true&gid=1&output=csv"
> medals <- read.csv(googleLink, stringsAsFactors = FALSE)
> savePlot <- TRUE # optional variable used to save or not save plots in code
>
> # remove rows that do not contain data
> medals$Year <- as.numeric(medals$Year)
Warning message:
NAs introduced by coercion
> medals <- medals[!is.na(medals$Year), ]
>
>
> # Quick look at data
> head(medals)
Year City Sport Discipline NOC Event Event.gender
1 1924 Chamonix Skating Figure skating AUT individual M
2 1924 Chamonix Skating Figure skating AUT individual W
3 1924 Chamonix Skating Figure skating AUT pairs X
4 1924 Chamonix Bobsleigh Bobsleigh BEL four-man M
5 1924 Chamonix Ice Hockey Ice Hockey CAN ice hockey M
6 1924 Chamonix Biathlon Biathlon FIN military patrol M
Medal
1 Silver
2 Gold
3 Gold
4 Bronze
5 Gold
6 Silver
> sapply(medals, function(x) cbind(sort(table(x), decreasing = TRUE)))
$Year
[,1]
2006 252
2002 234
1998 205
1994 183
1992 171
1988 138
1984 117
1980 115
1976 111
1968 106
1972 105
1964 103
1960 81
1956 72
1948 68
1952 67
1936 51
1924 49
1932 42
1928 41

$City
[,1]
Turin 252
Salt Lake City 234
Innsbruck 214
Nagano 205
Lillehammer 183
Albertville 171
Lake Placid 157
Calgary 138
Sarajevo 117
St. Moritz 109
Grenoble 106
Sapporo 105
Squaw Valley 81
Cortina d'Ampezzo 72
Oslo 67
Garmisch-Partenkirchen 51
Chamonix 49

$Sport
[,1]
Skiing 1060
Skating 758
Biathlon 162
Bobsleigh 133
Luge 108
Ice Hockey 69
Curling 21

$Discipline
[,1]
Speed skating 455
Cross Country S 399
Alpine Skiing 367
Figure skating 207
Biathlon 162
Bobsleigh 115
Ski Jumping 114
Luge 108
Short Track S. 96
Nordic Combined 84
Ice Hockey 69
Freestyle Ski. 54
Snowboard 42
Curling 21
Skeleton 18

$NOC
[,1]
NOR 280
USA 216
URS 194
AUT 185
GER 158
FIN 151
CAN 119
SUI 118
SWE 118
GDR 110
ITA 101
FRA 83
NED 78
RUS 76
FRG 41
CHN 33
JPN 32
KOR 31
TCH 25
EUN 23
GBR 21
EUA 19
CZE 10
LIE 9
POL 8
CRO 7
AUS 6
BLR 6
BUL 6
EST 6
HUN 6
BEL 5
KAZ 5
UKR 5
SLO 4
YUG 4
ESP 2
LUX 2
PRK 2
DEN 1
LAT 1
NZL 1
ROU 1
SVK 1
UZB 1

$Event
[,1]
individual 195
500m 133
1500m 111
downhill 97
slalom 96
1000m 94
giant slalom 90
5000m 78
singles 72
ice hockey 69
10km 60
50km 60
K90 individual (70m) 60
pairs 60
10000m 57
two-man 57
four-man 55
4x10km relay 51
15km 48
20km 45
4x7.5km relay 42
alpine combined 42
3000m 39
30km mass start 39
doubles 36
K120 individual (90m) 36
super-G 36
5km 30
moguls 30
4x5km relay 27
ice dancing 27
aerials 24
curling 21
10km pursuit 18
18km 18
Half-pipe 18
K120 team (90m) 18
Team 18
15km mass start 15
3000m relay 15
30km 15
3x5km relay 15
5000m relay 15
7.5km 15
Giant parallel slalom 12
Combined 10km + 15km pursuit 9
Combined 5km + 10km pursuit 9
12.5km pursuit 6
Alpine combined 6
giant-slalom 6
Snowboard Cross 6
Sprint 1,5km 6
sprint 1.5km 6
Team pursuit 6
Team sprint 6
12,5km mass start 3
3x7.5km relay 3
4x6km relay 3
5km pursuit 3
combined (4 events) 3
Combined 15 + 15km mass start 3
Combined 7.5 + 7.5km mass start 3
five-man 3
Individual 3
Individual sprint 3
military patrol 3
sprint 3

$Event.gender
[,1]
M 1386
W 802
X 123

$Medal
[,1]
Gold 774
Silver 773
Bronze 764

>
>
> # How many medals have been awarded in each Olympics?
> medalsByYear <- aggregate(medals$Year, list(Year = medals$Year), length)
> if (savePlot == TRUE) png("fig1.png")
> plot(x ~ Year, medalsByYear, ylim = c(0,max(x)),
+ ylab = "Total Medals Awarded", bty="l",
+ main = "Total Medals Awarded in Winter Olympics by Year")
> if (savePlot == TRUE) dev.off()
windows
2
>
> # How has the amount of medals awarded to males and females changed over the years?
> # Get data.
> medalsByYearByGender <- aggregate(medals$Year,
+ list(Year = medals$Year, Event.gender = medals$Event.gender), length)
> medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != "X", ]
>
> # Plot results.
> if (savePlot == TRUE) png("fig2.png")
> plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ],
+ ylim = c(0,max(x)), pch = "m", col = "blue",
+ ylab = "Total Medals Awarded", bty="l",
+ main = "Total Medals Awarded in Winter Olympicsn by Gender and by Year")
> points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
+ medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"],
+ col = "red", pch = "f")
> if (savePlot == TRUE) dev.off()
windows
2
>
> # Table of proportion female
> propFemalePerYear <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / (
+ medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] +
+ medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
> propFemalePerYear <- round(propFemalePerYear, 2)
> cbind(Year = medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
+ PropFemale = propFemalePerYear)
Year PropFemale
[1,] 1924 0.07
[2,] 1928 0.08
[3,] 1932 0.08
[4,] 1936 0.12
[5,] 1948 0.18
[6,] 1952 0.23
[7,] 1956 0.26
[8,] 1960 0.38
[9,] 1964 0.37
[10,] 1968 0.37
[11,] 1972 0.36
[12,] 1976 0.35
[13,] 1980 0.34
[14,] 1984 0.36
[15,] 1988 0.37
[16,] 1992 0.43
[17,] 1994 0.43
[18,] 1998 0.44
[19,] 2002 0.45
[20,] 2006 0.46
>
>
> # Which countries have won the most medals?
> sort(table(medals$NOC), dec = TRUE)

NOR USA URS AUT GER FIN CAN SUI SWE GDR ITA FRA NED RUS FRG CHN JPN KOR TCH EUN
280 216 194 185 158 151 119 118 118 110 101 83 78 76 41 33 32 31 25 23
GBR EUA CZE LIE POL CRO AUS BLR BUL EST HUN BEL KAZ UKR SLO YUG ESP LUX PRK DEN
21 19 10 9 8 7 6 6 6 6 6 5 5 5 4 4 2 2 2 1
LAT NZL ROU SVK UZB
1 1 1 1 1
>
>
> # Of the countries that have won more than 50 medals,
> # which have the highest percentage of gold medals?
> NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
> medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
> medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), margin = 1)
> medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"],
+ decreasing = TRUE), c("Gold", "Silver", "Bronze")]
> round(medalsByMedalByNOC, 2)

Gold Silver Bronze
RUS 0.43 0.32 0.25
URS 0.40 0.29 0.30
GER 0.37 0.37 0.26
SWE 0.36 0.26 0.37
USA 0.36 0.37 0.27
ITA 0.36 0.31 0.34
GDR 0.35 0.33 0.32
NOR 0.35 0.35 0.30
SUI 0.32 0.31 0.36
NED 0.32 0.38 0.29
CAN 0.32 0.32 0.36
FRA 0.30 0.29 0.41
AUT 0.28 0.35 0.38
FIN 0.27 0.38 0.34
>
>
> # How many different countries have won medals by year?
> listOfYears <- unique(medals$Year)
> names(listOfYears) <- unique(medals$Year)
> totalNocByYear <- sapply(listOfYears, function(X)
+ length(table(medals[medals$Year == X, "NOC"])))
>
> # Table
> totalNocByYear
1924 1928 1932 1936 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992
10 12 10 11 13 14 13 14 14 15 17 16 19 17 17 20
1994 1998 2002 2006
22 24 24 26
>
> # Plot
> if (savePlot == TRUE) png("fig3.png")
> plot(x= names(totalNocByYear), totalNocByYear,
+ ylim = c(0, max(totalNocByYear)),
+ xlab = "Year",
+ ylab = "Total Number of Countries",
+ bty = "l",
+ main = "Total Number of Countriesn Winning Medals By Year")
> if (savePlot == TRUE) dev.off()
windows
2
>
> # Which Countries have won a medal at every Olympics?
> propYearsOnePlusMedals <- apply(table(medals$NOC, medals$Year) > 0, 1, mean)
>
> #Answer
> names(propYearsOnePlusMedals[propYearsOnePlusMedals == 1.0])
[1] "AUT" "CAN" "FIN" "NOR" "SWE" "USA"
>
> # Table Sorted by Proportion of Olympics with a Medal
> cbind(sort(propYearsOnePlusMedals, decreasing = TRUE))
[,1]
AUT 1.00
CAN 1.00
FIN 1.00
NOR 1.00
SWE 1.00
USA 1.00
FRA 0.95
SUI 0.95
ITA 0.80
GBR 0.65
NED 0.65
TCH 0.55
JPN 0.50
GER 0.45
URS 0.45
FRG 0.35
GDR 0.30
HUN 0.30
CHN 0.25
KOR 0.25
POL 0.25
AUS 0.20
BEL 0.20
BLR 0.20
BUL 0.20
LIE 0.20
RUS 0.20
CZE 0.15
EUA 0.15
UKR 0.15
CRO 0.10
ESP 0.10
EST 0.10
KAZ 0.10
PRK 0.10
SLO 0.10
YUG 0.10
DEN 0.05
EUN 0.05
LAT 0.05
LUX 0.05
NZL 0.05
ROU 0.05
SVK 0.05
UZB 0.05


THE PLOTS






THE R SOURCE CODE
# tips on reading a Google Spreadsheet:
# http://blog.revolution-computing.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html
# Data taken from:"https://spreadsheets.google.com/ccc?key=0AgdO92JOXxAOdDVlaUpkNlB2WERtV3l1ZVFYbzllQWc"
# http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country

googleLink <- "http://spreadsheets.google.com/pub?key=tsddww6vOYePkhPSxRpDeYw&single=true&gid=1&output=csv"
medals <- read.csv(googleLink, stringsAsFactors = FALSE)
savePlot <- TRUE # optional variable used to save or not save plots in code

# remove rows that do not contain data
medals$Year <- as.numeric(medals$Year)
medals <- medals[!is.na(medals$Year), ]


# Quick look at data
head(medals)
sapply(medals, function(x) cbind(sort(table(x), decreasing = TRUE)))


# How many medals have been awarded in each Olympics?
medalsByYear <- aggregate(medals$Year, list(Year = medals$Year), length)
if (savePlot == TRUE) png("fig1.png")
plot(x ~ Year, medalsByYear, ylim = c(0,max(x)),
ylab = "Total Medals Awarded", bty="l",
main = "Total Medals Awarded in Winter Olympics by Year")
if (savePlot == TRUE) dev.off()

# How has the amount of medals awarded to males and females changed over the years?
# Get data.
medalsByYearByGender <- aggregate(medals$Year,
list(Year = medals$Year, Event.gender = medals$Event.gender), length)
medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != "X", ]

# Plot results.
if (savePlot == TRUE) png("fig2.png")
plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ],
ylim = c(0,max(x)), pch = "m", col = "blue",
ylab = "Total Medals Awarded", bty="l",
main = "Total Medals Awarded in Winter Olympicsn by Gender and by Year")
points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"],
col = "red", pch = "f")
if (savePlot == TRUE) dev.off()

# Table of proportion female
propFemalePerYear <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / (
medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] +
medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
propFemalePerYear <- round(propFemalePerYear, 2)
cbind(Year = medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
PropFemale = propFemalePerYear)


# Which countries have won the most medals?
sort(table(medals$NOC), dec = TRUE)


# Of the countries that have won more than 50 medals,
# which have the highest percentage of gold medals?
NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), margin = 1)
medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"],
decreasing = TRUE), c("Gold", "Silver", "Bronze")]
round(medalsByMedalByNOC, 2)


# How many different countries have won medals by year?
listOfYears <- unique(medals$Year)
names(listOfYears) <- unique(medals$Year)
totalNocByYear <- sapply(listOfYears, function(X)
length(table(medals[medals$Year == X, "NOC"])))

# Table
totalNocByYear

# Plot
if (savePlot == TRUE) png("fig3.png")
plot(x= names(totalNocByYear), totalNocByYear,
ylim = c(0, max(totalNocByYear)),
xlab = "Year",
ylab = "Total Number of Countries",
bty = "l",
main = "Total Number of Countriesn Winning Medals By Year")
if (savePlot == TRUE) dev.off()

# Which Countries have won a medal at every Olympics?
propYearsOnePlusMedals <- apply(table(medals$NOC, medals$Year) > 0, 1, mean)

#Answer
names(propYearsOnePlusMedals[propYearsOnePlusMedals == 1.0])

# Table Sorted by Proportion of Olympics with a Medal
cbind(sort(propYearsOnePlusMedals, decreasing = TRUE))

To leave a comment for the author, please follow the link and comment on his blog: Jeromy Anglim's Blog: Psychology and Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.