24 Days of R: Day 4

December 4, 2013

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

So my first attempt to sort out the career of Michael Caine via parsing of HTML data was a wash. I'm going to try this again, using Wikipedia. They've got a nice, easy list of his films in an HTML table. Reading an HTML table into R is incredibly easy. The XML library has a function to sort that out.


URL = "http://en.wikipedia.org/wiki/Michael_Caine_filmography"
dfCaine = readHTMLTable(URL, stringsAsFactors = FALSE)
dfCaine = dfCaine[[1]]

Man, that was easy. If I can catch a few more hours, I'll spend more time with imdb's file. Until then, though, this is just so much less hassle. So, once again, let's try to learn about Michael Caine.

plt = ggplot(dfCaine, aes(Year)) + geom_bar()

plot of chunk plotYears

And we get a much different picture than what we saw with my poor attempt to munge the imdb data. With this, there doesn't appear to have been a late career resurgence, at least as far as the number of films. Let's have a quick look at prestige, though. We'll not bother to distinguish between a nomination and receipt of an award. For now, we'll just zero in on the word “award”.

award = grep("award", tolower(dfCaine$Notes))
dfCaine$award = FALSE
dfCaine$award[award] = TRUE
qplot(Year, data = dfCaine, geom = "bar", fill = award)

plot of chunk awards

So, it doesn't appear as though critics have taken special note of him in his later years. I would have hypothesized that actors are often judged against a body of work, which would mean that there is a greater likelihood that they will be recognized as they get older. That doesn't appear to be the case here.

Finally, let's compress this into decades. The display of years works fine here in RStudio, but looks fairly dreadful on the web. Sir Michael has been around long enough that we can bin his career into ten year intervals with little loss of information.

dfCaine$Year = as.numeric(dfCaine$Year)
dfCaine$Decade = trunc((dfCaine$Year - 1900)/10) * 10
qplot(Decade, data = dfCaine, geom = "bar", fill = award)

plot of chunk decades

When the data are aggregated in this way, the 80's look like a bit of a high point (Educating Rita and Hannah and Her Sisters), with a more abrupt drop off in the 90's. The 2010's are just getting started. Let's hope it's a good decade for Michael Caine.

BTW, Kansas City has a society of film critics. I've never been to Kansas City and don't want to cast any aspersions, but it's hardly a hotbed of cultural activity. Do they really need a society of critics?

Tomorrow: more census data

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] knitr_1.4.1      RWordPress_0.2-3 ggplot2_0.9.3.1  XML_3.98-1.1    
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-3   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.2        
##  [7] gtable_0.1.2       labeling_0.2       markdown_0.6.3    
## [10] MASS_7.3-29        munsell_0.4.2      plyr_1.8          
## [13] proto_0.3-10       RColorBrewer_1.0-5 RCurl_1.95-4.1    
## [16] reshape2_1.2.2     scales_0.2.3       stringr_0.6.2     
## [19] tools_3.0.2        XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training





CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)