Coordinates: 2014 September 15-17 in the London borough of #rstats.
I had just the right number of R bugs so that I could walk to the drinks and arrive fashionably late. On the way, I realized that I hadn’t been near the Tower of London since the first year I moved to London even though I live within walking distance.
Once arriving and receiving my Mango the cat, I got into the mix.
Figure 1: Mango the cat with friend (courtesy of Andrie de Vries).
I was asked, “What do you think is new and exciting in R?”
I didn’t have an answer, but my answer should have been the conference we were just embarking on. I think the advent of EARL boosts the chances for my prediction that R will be standard in commercial settings by 2020.
Markus Gesmann: I’m chairing the session you are speaking in tomorrow, what would you like for an introduction?
me: A chorus line would be nice.
In the course of the evening conversation, I had the opportunity to talk about the New York Times US dialect test (which does have an R connection, but that is not why I was talking about it).
plenary talk 1
Hadley talks about the importance of pipes using the
magrittr package. Pipes can be used when the type of the first arguments of the functions is the same as their results. Liz hyperventilates.
Hadley talks about
tidyr. Liz hyperventilates.
Hadley talks about
dplyr. Ben Goldacre appears, Liz continues to hyperventilate.
Hadley talks about
ggvis. Liz ceases to hyperventilate, remains pale.
plenary talk 2
Ben Goldacre starts by confessing to being a Stata user (but he did take a decontamination course that didn’t stick). He gives a range of examples of how the discipline of statistics is either ignored or abused in the pharmaceutical industry and medical practice.
Primarily what I learned was that when you read Ben’s books you should read them really, really fast. And LOUD.
The slide that seemed to generate the most discussion illustrated the Dunning-Kruger effect. The slide showed the actual performance on a test and the perception of the the test-takers on their own performance. The general shape was that those who tested below and up to the median thought they scored slightly above median; those slightly above median were accurate; and the best slightly underestimated their performance.
The Foole doth thinke he is wise, but the wiseman knowes himselfe to be a Foole.
— from “As You Like It” by William Shakespeare
In discussion with a few people, Ben says that he’d like there to be a “baby R”. His motivation is wanting a system that his students can easily use and that can be set up to give immediate feedback in class. Ben is keenly aware that what he is asking for is not easily built.
swimming in data
Markus Gesmann (remember him, the one who promised me a chorus line?) talked about some of his successes at Lloyds of London to use data to drive profitable decisions. He had a nice slide of the trade-off of automation relative to the size of the task. There were also improvements on the whale charts that he has talked about elsewhere.
Tim Paulden of Atass Sports gave a nice talk on how to predict the outcome of a tennis match given the rankings of the two players. He started with a very naive model, and showed what was good and bad about that model. Then looked at more elaborate models, some of which were worse in practical terms than the naive model. The talk progressed like modeling should generally be done.
long road from wrist to brain
Immediately after tennis was Joss Langford from Activinsights. A high frequency time series is generated by an accelerometer worn on the wrist. The position of the wrist is inferred from the data. The activities that are being done then need to be inferred from the wrist motion. It would be easy to suggest 100 PhD topics out of this talk.
In addition to learning what the title meant, there was an interesting anomaly in the analysis. Kevin Savage of Mendeley built a model to predict the number of users. Tal Galili spotted during the presentation that the residuals from the
stl fit are decidedly autocorrelated — not at all like your average set of residuals. I think the problem is that there is strong seasonality but that it is not on a regular schedule, the period seems to be different in different cycles.
The reason the talk was called “Toilet Stats” was that a chart of the predictions was posted above the toilet, where workers could ponder the company’s future.
making money with R
During the break Joss Langford offered his opinion on R as an example of an open source project. He thinks that the sign of a successful project is one that people can make money off of (think Red Hat and Linux). This may grate on those who want there to be only an economy based on status, but I think it is correct. He doesn’t think that R is there yet. One thing he sees as missing is a map of the major players to orient newcomers.
I gave my talk on risk management. The chorus line failed to appear. Not even a hint of one — how disappointing.
Somehow in the slides I forgot to include
svUnit as an R package that does unit testing.
Richard Saldanha of Investec gave a talk that was complementary to mine. I showed a plot of liquidity, he talked about a function that would produce the ingredients for the plot. The two were created completely independently — different solutions to almost the same problem.
The finance session also included John James of Tibco talking about a large optimization problem, and David Jessop and Claire Jones from UBS talking about the problems of estimating betas from high frequency market data.
that networking thing
Pat (to stranger): Hi, I’m Pat.
(We talk for a while.)
Stranger (to stranger’s friend who has arrived): This is the Pat that Data Table Guy talked about.
Stranger’s friend nods head knowingly.
[Editor’s note: such talk by Data Table Guy speciously raises expectations of Pat.]
Dinner was across the river and upstream a few meters aboard HMS Belfast. An attendee from Belfast found it to be false advertising — it doesn’t have anything to do with Belfast.
The evening included several interesting conversations, including one with a certain gatecrasher.
back at home
Pat: We had dinner on a battleship in the Thames tonight.
Wife: Did you fire any cannons at buildings or anything?
Wife: All those programmers and what good are they?
Frank Hedler and Ryan Howard of Simpson Carpenter shouted their appreciation of R: it gives a small team with a limited budget a large tool kit.
sharing data analyses
Chad Goymer at Lloyds of London talked about their system of bringing the benefits of R to those who don’t use R. An analysis can be specified in Excel, R does the analysis and then a report is created in Excel, html or Word.
There was a session devoted to the DDMoRe project. Their summary of themselves is:
The Drug Disease Model Resources (DDMoRe) consortium builds and maintains a universally applicable, open source, model based framework, intended as the gold standard for future collaborative drug and disease Modelling & Simulation.
R is an integral part of the project.
lunch with Romain
I happened to be in the queue for food with Romain Francois. I asked him of the prospect of the R Graph Gallery coming back online. He said that he doesn’t feel he has the time to deal with it, but he’d be happy to pass it on to someone else.
There are now a couple things calling themselves “R Graph Gallery” but they are pale imitations of the real thing. Please, please someone grab it from Romain and give it back to us. Then I won’t have to feel guilty about still having the link on my website.
Wit Jakuczun from WLOG Solutions talked about optimization.
the smell of R
Steven Fitzpatrick talked about R at Firmenich, a company you may not have heard of but which you undoubtedly have experienced. It produces fragrances and flavors — lots of them.
What was fascinating about this talk was the attitude towards R. They seem not to be open-source crusaders — they are building almost all of what they do in-house just for themselves. They are not using R to save money — they are apparently not so frugal with their development. Perhaps I’m misconstruing, but the big plus for them seems to be R’s power and flexibility.
Alistair Crossling used mobile phone data to show concentrations of people throughout the day in central London.
During the break I was asked what qualities of R make it especially useful for data analysis. It seemed like an excellent question for which I wasn’t entirely prepared.
I said some of the obvious things:
- R is strong in graphics (which should be central when analyzing data).
- R has an extraordinary number of packages devoted to all sorts of data analysis.
- The R community is very strong, and quite devoted to data analysis.
I didn’t say that R is both a language and interactive. Languages are expressive, and data analysis is inherently an interactive process. What you find at one point determines what you will want to do next. This duality of R is also the source of some of its users’ trauma.
Another characteristic is that functions are first-class objects. You might have a statement like:
If instead you want the median or some bizarre estimate that you make up on the spot, then that is trivial to arrange. This flexibility serves the same sort of needs as interactivity.
The final session I attended was my favorite. This is partly because my expectations weren’t particularly high (I hadn’t thought about it very hard). The speakers in the session were Tal Galili, Matt Sundquist and John Burn-Murdoch.
Tal Galili’s dendextend package enhances the ability to plot tree structures such as those produced by hierarchical clustering.
This talk convinced me of the periodic usefulness of pipes (of the
magrittr variety). Hadley had softened the ground, Tal pushed me over the edge. I do think there is a problem though. The advantage of pipes is that they are more intuitive — once you get used to them. But people will still need to be able to decipher nested function calls. So pipes add to the R disease of there being multiple incongruent ways of doing the same thing.
After the session I told Tal that I thought it was a great talk. Once he recovered from the shock, he managed to ask me why I thought that. I had a few reasons:
- Tal was being Tal, and the audience got to see that. His personality came through.
- In general I think showing code in a talk is a mistake. Tal broke this rule but made it work. There were only ever small snippets of code visible at a time, and the snippets were simple enough that they moved the story along (as opposed to the usual effect of killing off the story).
- Each slide was simple enough to be understood with minimal effort (given Tal’s comments).
Matt Sundquist showed us plotly. It is mind blowing.
What else can I say?
However, he made a statement that sounded to me like grinding gears to a mechanic. He said he wasn’t a real programmer because when he came to something that he didn’t know he just searched for it on the web and copied what he found. That is exactly what “real” programmers do.
Encores have been announced: 2015 September 14-16 in London and 2015 November 2-4 in Boston. Stay tuned at http://www.earl-conference.com/.