# 24 Days of R: Day 4

December 4, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

So my first attempt to sort out the career of Michael Caine via parsing of HTML data was a wash. I'm going to try this again, using Wikipedia. They've got a nice, easy list of his films in an HTML table. Reading an HTML table into R is incredibly easy. The `XML` library has a function to sort that out.

```library(XML)

URL = "http://en.wikipedia.org/wiki/Michael_Caine_filmography"
dfCaine = readHTMLTable(URL, stringsAsFactors = FALSE)
dfCaine = dfCaine[[1]]
```

Man, that was easy. If I can catch a few more hours, I'll spend more time with imdb's file. Until then, though, this is just so much less hassle. So, once again, let's try to learn about Michael Caine.

```library(ggplot2)
plt = ggplot(dfCaine, aes(Year)) + geom_bar()
plt
```

And we get a much different picture than what we saw with my poor attempt to munge the imdb data. With this, there doesn't appear to have been a late career resurgence, at least as far as the number of films. Let's have a quick look at prestige, though. We'll not bother to distinguish between a nomination and receipt of an award. For now, we'll just zero in on the word “award”.

```award = grep("award", tolower(dfCaine\$Notes))
dfCaine\$award = FALSE
dfCaine\$award[award] = TRUE
qplot(Year, data = dfCaine, geom = "bar", fill = award)
```

So, it doesn't appear as though critics have taken special note of him in his later years. I would have hypothesized that actors are often judged against a body of work, which would mean that there is a greater likelihood that they will be recognized as they get older. That doesn't appear to be the case here.

Finally, let's compress this into decades. The display of years works fine here in RStudio, but looks fairly dreadful on the web. Sir Michael has been around long enough that we can bin his career into ten year intervals with little loss of information.

```dfCaine\$Year = as.numeric(dfCaine\$Year)
dfCaine\$Decade = trunc((dfCaine\$Year - 1900)/10) * 10
qplot(Decade, data = dfCaine, geom = "bar", fill = award)
```

When the data are aggregated in this way, the 80's look like a bit of a high point (Educating Rita and Hannah and Her Sisters), with a more abrupt drop off in the 90's. The 2010's are just getting started. Let's hope it's a good decade for Michael Caine.

BTW, Kansas City has a society of film critics. I've never been to Kansas City and don't want to cast any aspersions, but it's hardly a hotbed of cultural activity. Do they really need a society of critics?

Tomorrow: more census data

```sessionInfo()
```
```## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] knitr_1.4.1      RWordPress_0.2-3 ggplot2_0.9.3.1  XML_3.98-1.1
##
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-3   dichromat_2.0-0    digest_0.6.3
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.2
##  [7] gtable_0.1.2       labeling_0.2       markdown_0.6.3
## [10] MASS_7.3-29        munsell_0.4.2      plyr_1.8
## [13] proto_0.3-10       RColorBrewer_1.0-5 RCurl_1.95-4.1
## [16] reshape2_1.2.2     scales_0.2.3       stringr_0.6.2
## [19] tools_3.0.2        XMLRPC_0.3-0
```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...