24 Days of R: Day 4
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So my first attempt to sort out the career of Michael Caine via parsing of HTML data was a wash. I'm going to try this again, using Wikipedia. They've got a nice, easy list of his films in an HTML table. Reading an HTML table into R is incredibly easy. The XML
library has a function to sort that out.
library(XML) URL = "http://en.wikipedia.org/wiki/Michael_Caine_filmography" dfCaine = readHTMLTable(URL, stringsAsFactors = FALSE) dfCaine = dfCaine[[1]]
Man, that was easy. If I can catch a few more hours, I'll spend more time with imdb's file. Until then, though, this is just so much less hassle. So, once again, let's try to learn about Michael Caine.
library(ggplot2) plt = ggplot(dfCaine, aes(Year)) + geom_bar() plt
And we get a much different picture than what we saw with my poor attempt to munge the imdb data. With this, there doesn't appear to have been a late career resurgence, at least as far as the number of films. Let's have a quick look at prestige, though. We'll not bother to distinguish between a nomination and receipt of an award. For now, we'll just zero in on the word “award”.
award = grep("award", tolower(dfCaine$Notes)) dfCaine$award = FALSE dfCaine$award[award] = TRUE qplot(Year, data = dfCaine, geom = "bar", fill = award)
So, it doesn't appear as though critics have taken special note of him in his later years. I would have hypothesized that actors are often judged against a body of work, which would mean that there is a greater likelihood that they will be recognized as they get older. That doesn't appear to be the case here.
Finally, let's compress this into decades. The display of years works fine here in RStudio, but looks fairly dreadful on the web. Sir Michael has been around long enough that we can bin his career into ten year intervals with little loss of information.
dfCaine$Year = as.numeric(dfCaine$Year) dfCaine$Decade = trunc((dfCaine$Year - 1900)/10) * 10 qplot(Decade, data = dfCaine, geom = "bar", fill = award)
When the data are aggregated in this way, the 80's look like a bit of a high point (Educating Rita and Hannah and Her Sisters), with a more abrupt drop off in the 90's. The 2010's are just getting started. Let's hope it's a good decade for Michael Caine.
BTW, Kansas City has a society of film critics. I've never been to Kansas City and don't want to cast any aspersions, but it's hardly a hotbed of cultural activity. Do they really need a society of critics?
Tomorrow: more census data
sessionInfo() ## R version 3.0.2 (2013-09-25) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] knitr_1.4.1 RWordPress_0.2-3 ggplot2_0.9.3.1 XML_3.98-1.1 ## ## loaded via a namespace (and not attached): ## [1] colorspace_1.2-3 dichromat_2.0-0 digest_0.6.3 ## [4] evaluate_0.4.7 formatR_0.9 grid_3.0.2 ## [7] gtable_0.1.2 labeling_0.2 markdown_0.6.3 ## [10] MASS_7.3-29 munsell_0.4.2 plyr_1.8 ## [13] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 ## [16] reshape2_1.2.2 scales_0.2.3 stringr_0.6.2 ## [19] tools_3.0.2 XMLRPC_0.3-0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.