Last Friday the Cologne R user group came together for the 15th time. Since its inception over three years ago the group evolved from a small gathering in a pub into an active data science community, covering wider topics than just R. Still, R is the link and clue between the different interests. Last Friday’s agenda was a good example of this, with three talks touching on workflow management, web development and risk analysis.
R in a big data pipeline
Yuki Katoh had travelled all the way from Berlin to present on how to embed R with
luigi into a heterogeneous workflow of different applications. This is especially useful when R needs to be integrated with hadoop/hdfs based technologies, such as Spark and Hive. Luigi is not unlike Make, which Kirill presented at our last meeting in June. In a configuration file Yuki specified the various workflow steps and dependencies between the jobs.
Kicking off the luigi script starts the workflow, and
luigid server allows Yuki to monitor the various parts of the dependency graph visually. Thus, he can see the progress of his workflow in real time and identify quickly, when and where a sub process fails. As Yuki pointed out, this becomes critical in production systems, where failures need to be known and fixed quickly, unlike when ones carries out an explorative analysis in a development/research environment. See also Yuki’s blog post for further details.
Shiny + Shinyjs
|Download presentation files|
Shiny is a very popular R package that allows users to develop interactive browser applications. Paul Viefers introduced us to the extension
Paul showed us an example of a shinyapp that depending on the user plotted a different graph. Behind the scene his script would either hide or shows those plots, conditioned on the user. With only a few lines in R it allowed him to develop a user specific application. To achieve this he created a login screen that checks for user name and password. In his example he had hard coded the login credentials, instead of using a secure connection via a professional shiny server instance. However this was sufficient for his purpose, where he tests how students react to different economic scenarios in a lab environment at university.
Experience vs. Data
The last talk of the meeting had a more statistical focus with examples from insurance. I repeated my talk from the LondonR user group meeting in June. One of the challenge in insurance is that despite of having many customers , insurance companies will have little claims data per customer to assess risks.
I presented some Bayesian ideas to analyse risks with little data. I used the wonderful “Hit and run accident” example from Daniel Kahneman’s book Thinking, fast and slow to explain Bayes’ formula, introduced Bayesian belief networks for a claims analysis and discussed the challenge of predicting events when they haven’t happened yet (also in Stan). Along the way I mentioned a few ideas on communicating risk, which I learned from David Spiegelhalter earlier this year.
Next Kölner R meeting
Please get in touch, if you would like to present at the next meeting.