24 Days of R: Day 10

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How often is someone nominated for an academy award? Who has been nominated most often? Is there a difference between leading and supporting roles? Important questions. To answer them, I'm making use of a list of academy award nominees and winners. I've obtained the data from aggdata.com which has a few sets of free data. We'll open the file, do some basic clean up and then have a look at the results for Michael Caine. Note that these results are only through 2010.

dfAwards = read.csv("./Data/academy_awards.csv", stringsAsFactors = FALSE)
dfAwards = dfAwards[, 1:5]
dfAwards$Year = as.numeric(substr(dfAwards$Year, 1, 4))
colnames(dfAwards) = gsub(".", "", colnames(dfAwards), fixed = TRUE)
dfAwards$Won = dfAwards$Won == "YES"

dfCaine = subset(dfAwards, Nominee == "Michael Caine")
row.names(dfCaine) = NULL

FirstNominated = min(dfCaine$Year)
FirstWon = min(dfCaine$Year[dfCaine$Won == TRUE])

Michael Caine has been nominated 6 times and has won 2 times. It took 20 years for him to win his first award. That's a long time. My guess is that actors receive more multiple nominations and receive nominations over a longer period of time. I'll split the data into actor and actress categories to test this.

dfAwards$Gender = "Other"
dfAwards$Gender[grepl("Actor", dfAwards$Category)] = "Actor"
dfAwards$Gender[grepl("Actress", dfAwards$Category)] = "Actress"
dfActors = subset(dfAwards, Gender != "Other")
row.names(dfActors) = NULL

library(plyr)
plyActor = ddply(dfActors, .(Nominee, Gender), summarize, FirstNominated = min(Year), 
    NumberNominated = length(Year), LastNominated = max(Year))

plyActor$Span = plyActor$LastNominated - plyActor$FirstNominated
row.names(plyActor) = NULL
meanActor = mean(plyActor$Span[plyActor$Gender == "Actor"])
meanActress = mean(plyActor$Span[plyActor$Gender == "Actress"])

We see that the mean length of time between first and last nomination is fairly comparable. Mean have a slightly longer span, but only just. A box plot of the span looks like this:

library(ggplot2)
ggplot(plyActor, aes(factor(Gender), Span)) + geom_boxplot()

plot of chunk Plots

We'll do the same for number of nominations. It's a similar window into the potential longevity of someone's career, or the degree to which someone commands attention.

actorNominees = mean(plyActor$NumberNominated[plyActor$Gender == "Actor"])
actressNominees = mean(plyActor$NumberNominated[plyActor$Gender == "Actress"])
ggplot(plyActor, aes(factor(Gender), NumberNominated)) + geom_boxplot()

plot of chunk NumberNominated

Curiously, just who are those individuals who have career spans greater than 40 years? And which people have been nominated more than 10 times”“

plyActor[plyActor$Span > 40, ]

##               Nominee  Gender FirstNominated NumberNominated LastNominated
## 321       Henry Fonda   Actor           1940               2          1981
## 455    Julie Christie Actress           1965               4          2007
## 466 Katharine Hepburn Actress           1932              12          1981
## 655       Paul Newman   Actor           1958               9          2002
## 671     Peter O'Toole   Actor           1962               8          2006
##     Span
## 321   41
## 455   42
## 466   49
## 655   44
## 671   44

plyActor[plyActor$NumberNominated >= 10, ]

##               Nominee  Gender FirstNominated NumberNominated LastNominated
## 77        Bette Davis Actress           1934              11          1962
## 345    Jack Nicholson   Actor           1969              12          2002
## 466 Katharine Hepburn Actress           1932              12          1981
## 594      Meryl Streep Actress           1978              16          2009
##     Span
## 77    28
## 345   33
## 466   49
## 594   31

OK, I could see that. Katharine Hepburn, Paul Newman, Julie Christie, Bette Davis. A superficial look suggests that gender may not suffer from an age bias. Mind, I'd love to have more data to explore this further. In the meantime, I think I'm going to go watch “On Golden Pond”. I saw it when it first came out and it was clearly one hell of a movie for older performers.

Tomorrow: Unsure what will be covered. I'm going to a PostgreSQL meetup, so possibly that.

sessionInfo()

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.4.1      RWordPress_0.2-3 ggplot2_0.9.3.1  plyr_1.8        
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-3   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.2        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-29       
## [10] munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5
## [13] RCurl_1.95-4.1     reshape2_1.2.2     scales_0.2.3      
## [16] stringr_0.6.2      tools_3.0.2        XML_3.98-1.1      
## [19] XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)