24 Days of R: Day 10

December 10, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

How often is someone nominated for an academy award? Who has been nominated most often? Is there a difference between leading and supporting roles? Important questions. To answer them, I'm making use of a list of academy award nominees and winners. I've obtained the data from aggdata.com which has a few sets of free data. We'll open the file, do some basic clean up and then have a look at the results for Michael Caine. Note that these results are only through 2010.

dfAwards = read.csv("./Data/academy_awards.csv", stringsAsFactors = FALSE)
dfAwards = dfAwards[, 1:5]
dfAwards$Year = as.numeric(substr(dfAwards$Year, 1, 4))
colnames(dfAwards) = gsub(".", "", colnames(dfAwards), fixed = TRUE)
dfAwards$Won = dfAwards$Won == "YES"

dfCaine = subset(dfAwards, Nominee == "Michael Caine")
row.names(dfCaine) = NULL

FirstNominated = min(dfCaine$Year)
FirstWon = min(dfCaine$Year[dfCaine$Won == TRUE])

Michael Caine has been nominated 6 times and has won 2 times. It took 20 years for him to win his first award. That's a long time. My guess is that actors receive more multiple nominations and receive nominations over a longer period of time. I'll split the data into actor and actress categories to test this.

dfAwards$Gender = "Other"
dfAwards$Gender[grepl("Actor", dfAwards$Category)] = "Actor"
dfAwards$Gender[grepl("Actress", dfAwards$Category)] = "Actress"
dfActors = subset(dfAwards, Gender != "Other")
row.names(dfActors) = NULL

library(plyr)
plyActor = ddply(dfActors, .(Nominee, Gender), summarize, FirstNominated = min(Year), 
    NumberNominated = length(Year), LastNominated = max(Year))

plyActor$Span = plyActor$LastNominated - plyActor$FirstNominated
row.names(plyActor) = NULL
meanActor = mean(plyActor$Span[plyActor$Gender == "Actor"])
meanActress = mean(plyActor$Span[plyActor$Gender == "Actress"])

We see that the mean length of time between first and last nomination is fairly comparable. Mean have a slightly longer span, but only just. A box plot of the span looks like this:

library(ggplot2)
ggplot(plyActor, aes(factor(Gender), Span)) + geom_boxplot()

plot of chunk Plots

We'll do the same for number of nominations. It's a similar window into the potential longevity of someone's career, or the degree to which someone commands attention.

actorNominees = mean(plyActor$NumberNominated[plyActor$Gender == "Actor"])
actressNominees = mean(plyActor$NumberNominated[plyActor$Gender == "Actress"])
ggplot(plyActor, aes(factor(Gender), NumberNominated)) + geom_boxplot()

plot of chunk NumberNominated

Curiously, just who are those individuals who have career spans greater than 40 years? And which people have been nominated more than 10 times"“

plyActor[plyActor$Span > 40, ]
##               Nominee  Gender FirstNominated NumberNominated LastNominated
## 321       Henry Fonda   Actor           1940               2          1981
## 455    Julie Christie Actress           1965               4          2007
## 466 Katharine Hepburn Actress           1932              12          1981
## 655       Paul Newman   Actor           1958               9          2002
## 671     Peter O'Toole   Actor           1962               8          2006
##     Span
## 321   41
## 455   42
## 466   49
## 655   44
## 671   44
plyActor[plyActor$NumberNominated >= 10, ]
##               Nominee  Gender FirstNominated NumberNominated LastNominated
## 77        Bette Davis Actress           1934              11          1962
## 345    Jack Nicholson   Actor           1969              12          2002
## 466 Katharine Hepburn Actress           1932              12          1981
## 594      Meryl Streep Actress           1978              16          2009
##     Span
## 77    28
## 345   33
## 466   49
## 594   31

OK, I could see that. Katharine Hepburn, Paul Newman, Julie Christie, Bette Davis. A superficial look suggests that gender may not suffer from an age bias. Mind, I'd love to have more data to explore this further. In the meantime, I think I'm going to go watch "On Golden Pond”. I saw it when it first came out and it was clearly one hell of a movie for older performers.

Tomorrow: Unsure what will be covered. I'm going to a PostgreSQL meetup, so possibly that.

sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.4.1      RWordPress_0.2-3 ggplot2_0.9.3.1  plyr_1.8        
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-3   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.2        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-29       
## [10] munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5
## [13] RCurl_1.95-4.1     reshape2_1.2.2     scales_0.2.3      
## [16] stringr_0.6.2      tools_3.0.2        XML_3.98-1.1      
## [19] XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on his blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.