Newcastle R course, a write-up

September 22, 2014

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

I recently attended a week-long R course in Newcastle, taught by Colin Gillespie. It went from “An Introduction to R” to “Advanced Graphics” via a day each on modelling, efficiency and programming. Suffice to say it was an intense 5 days!

Overall it was the best R course I’ve been on so far. I’d recommend it to others, from advanced users to adventurous novices. Below I explain why, with a brief description of each day and an emphasis on day 2.

Day 1

Day 1 introduced R in the context of alternatives: PythonSAS and the dreaded Microsoft Excel. Amusingly, Excel was often used as a reason of why to use R. I skipped the second half of the day but from what I saw it seemed a fun and practical introduction to an ostensibly dry subject.

Day 2

On the second day, we moved firmly into the realm of applications with a focus on statistics. In addition to new insights into R, I also gained new insights into statistical test. I wasn’t expecting this when signing up to an R course but it was well worth it. Colin’s description of the T-test, the related QQ plot and other commonly used tests were some of the clearest I’ve ever heard.

Quirks in R were raised, providing insight into its origins, one example being the number of datasets lying around: “An artifact of R being designed by statisticians is that there are loads of random datasets just lying around.”

Insights into T tests in R

sleep is a dataset in base R on the impact of different drugs on sleeping patterns of 10 students, which is actually rather useful for explaining t tests in R.

##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

A trick I didn’t know was that tapply can be used to run a command on sub-groups within a single vector:

tapply(sleep$extra, INDEX = sleep$group, FUN = sd)
##     1     2 
## 1.789 2.002

This shows that the standard deviation of extra sleep values are comparable, with the second group’s response showing slightly higher variability. These are the types of test needed to check whether the t test is appropriate. Because a t test is appropriate, we go ahead and run it, resulting in a significant difference between the two drugs:

t.test(extra ~ group, paired = TRUE, data = sleep, )
##  Paired t-test
## data:  extra by group
## t = -4.062, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4599 -0.7001
## sample estimates:
## mean of the differences 
##                   -1.58

A new insight into lists

The list is a well-known data class in R, used to contain collections of any other objects. I’ve always known that list objects should be referred to using the double square bracket notation – e.g. [[5]]. But what I never knew, until attending this course, was why. Lets take a look at a simple example:

lst <- list(x = c(1:3), y = "z", z = c(2:4))
## [1] 1 2 3
## $x
## [1] 1 2 3

We can see the second output is printed differently: the only difference between the single and double notation is the output type. Double square brackets output the object’s inate data type; single square brackets always output another list. This explains the error in the following code:

## $x
## [1] 1 2 3
## $y
## [1] "z"
## [1] 2
## $x
## [1] 1 2 3
## $y
## [1] "z"
## $z
## [1] 2 3 4
## Error: recursive indexing failed at level 2

In the second line the output seems unintuitive. This is how it works: 1:2 is the same as c(1, 2), a vector of length 2 and each number refers to a different ‘level’ in the list. Therefore [

] fails where 


lst[[c(2, 2)]]
## Error: subscript out of bounds
lst[c(2, 2)]
## $y
## [1] "z"
## $y
## [1] "z"

Adding interaction in ANOVA models

Another interesting little tidbit of knowledge I learned on the course was that the : symbol can be used in R formulae to represent interaction between two independent variables. Thus, y ~ a + b + a:b includes the influence of a and b on y, as well as the impact of their interaction. However, R users usually save time by writing this simply as y ~ a * b.

To test this example was provided – the latter two tests generate the same result (not shown):

m1 <- aov(weightgain ~ source, data = ratfeed)
m2 <- aov(weightgain ~ source * type, data = ratfeed)

m3 <- aov(weightgain ~ source * type, data = ratfeed)

No-nonsense introduction to clustering

Clustering is something that always seemed rather mysterious to me. The criteria by which observations are deemed ‘similar’ appear arbitrary. Indeed, Colin stated that ‘clustering never proves anything’ and that there are always infinite ways to cluster observations. The no-nonsense description was a breath of fresh air compared with other descriptions of clustering which hide the process behind a wall of jargon.

The simplest way to cluster data is via ‘Euclidean distance’ between each combination of observations. The distance between observation 1 and 2, for example, is simply, the sum of squared differences between each variable:




Day 3

Day 3 provided a solid introduction to programming in R with attention paid to if statements, function creations, the bizarre yet useful apply family and recommended R workflows.

It was great to have a solid description of if statements in R, which I’d used tentatively before but never felt 100% sure using. The following example was used to demonstrate multiple ifs and the final else that completes such structures:

x <- 5
if(x > 10){
  message("x is greater than 10")
} else if(x > 5){
  message("x is greater than 5")
} else if(x > 0){
  message("x is greater than 0")
} else if(x > -5){
  message("x is greater than -5")
} else {
  message("x is less than -5")
## x is greater than 0

Also, Colin’s description of R workflows was very useful: I already used a variant of it, but the description of how to structure work reaffirmed the importance of order for me and will be of use next time I teach R.

This is a variant of the file structure Colin recommends, that I’m implementing on an R for spatial microsimulation book project I’m working on:

|-- book.Rmd
|-- data
|-- figures
|-- output
|-- R
|   |-- load.R
|   `-- parallel-ipfp.R
`-- spatial-microsim-book.Rproj

Day 4

On Day 4 we were guided through benchmarking in R and its importance for developing fast code. I learned about the parallelpackage which is hidden away in base R (its functions are only activated with library(parallel) — try it!). parallel provides parallel versions of some of the apply functions, such as parApply. This is really useful if you have speed critical applications running on multi-core devices.

Parallelisation might get you a speed boost of a factor of 2 to 4, but dumb code can hit you with a 100 fold speed penalty.Therefore the priority should be to check your code for bottlenecks before even considering parallelisation.

A good example is avoiding object recreation/expansion during for loops: create the vector size you want outside the loop. This explains why m1 is so much slower than the rest. We also looked at vectorisation: m3 illustrates that vectorised code is often faster than for loop alternatives:

# method 1
m1 <- function(n){ mvec = NULL
for(i in 1:n){
  mvec = c(mvec, i)

# method 2
m2 <- function(n){
mvec = numeric(n)
for(i in 1:n){
  mvec[i] <- i

# method 3
m3 <- function(n){1:n}

n = 10000
mb <- microbenchmark(m1(n), m2(n), m3(n), times = 2L)
## Unit: microseconds
##   expr       min        lq    median        uq       max neval
##  m1(n) 161579.59 161579.59 167862.32 174145.05 174145.05     2
##  m2(n)   8500.44   8500.44   9222.45   9944.45   9944.45     2
##  m3(n)     19.66     19.66     21.49     23.33     23.33     2

Day 5

The was the reward for all our hard work on the other days: making pretty pictures. We looked into the internals of ggplot2and found a powerhouse graphics package inside. A guy sitting next to me used the insight gleaned from this final day to re-draw a plot he was presenting the subsequent week, so it was certainly useful!


Overall it was well worth the time and money and I look forward to attending the more advanced course on “Advanced Programming in R” and package creation. The course materials were excellent: clear, concise and entertaining at times (the first R tutorials I’ve seen plastered with cartoons!).

The practicals were based on a series of R packages that can be installed for free. Simply typing the following will give you an insight into the teaching on offer, and allow you to play Monopoly in R!

install.packages("nclRintroduction", repos = "")
install.packages("nclRmodelling", repos = "")
install.packages("nclRprogramming", repos = "")
install.packages("nclRefficient", repos = "")

vignette(package = "nclRprogramming")
vignette(topic = "practical2b", package = "nclRprogramming")

If you struggle with any of the questions, maybe you need to attend an R course or read a book to brush up! Stackoverflow is great but for some things you cannot beat face to face interaction or the depth of a book.

To leave a comment for the author, please follow the link and comment on their blog: Robin Lovelace - R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)