Video: Mining Tweets with R

This post shares the video from the talk presented in 2014 by Eu Jin Lok on mining tweets with R presented at Melbourne R Users.

Twitter is currently ranked in the top 10 for most-visited website, and averaging 500 million tweets a day. It is not just a microblogging outlet used by individuals and celebrities, but also for big commercial organisations, such as Telstra, NAB and Qantas, as a communication channel. However, few companies have deployed data analytics in this space due to the challenges in mining unstructured data. And hence, it is unclear what value can be achieved from mining twitter data. Eu Jin embarks on the journey to explore some of the data mining techniques that can be applied on tweets to uncover potential gems for business or personal use.

Eu Jin Lok is a data analyst for iSelect, a graduate from Monash University with a Masters in Econometrics and has been using R for more than 3 years now both professionally and for causal purposes (eg- Kaggle). This will be his 2nd talk for the MelbURN group and in this talk, he will embark on the task of applying data mining techniques on twitter feeds using a real example.

Additional Resources:

Video: Introduction to R Shiny

This post shares the video from the talk presented in November 2013 by Alec Stephenson providing an introduction to R shiny at Melbourne R Users.

R Shiny, from the people behind R Studio, allows you to quickly and easily build basic web applications using only the R language. I will be demonstrating the basics of web app creation, and will show you a number of examples for purposes such as data visualization and student learning. The talk will require only rudimentary knowledge of R. After the talk (45mins) you are welcome to join me at the Colonial Hotel for dinner.

Alec Stephenson is a CSIRO scientist and a former academic at The National University of Singapore and Swinburne University. It is his third talk for the MelbURN group, following previous talks on spatial data (Sept 2011) and speeding up your R code (Sept 2012). He has been writing R software since the days when there were only a hundred or so R packages. He still dislikes the ifelse function.

Additional Resources:

Video: Techniques to improve the accuracy of predictive models

This post shares the video from the talk presented in October 2013 by Phil Brierley on techniques to improve the accuracy of predictive models at Melbourne R Users.

The Heritage Health Prize was a two year predictive analytics competition that recently concluded, and Melbourne based Phil Brierley was the P in POWERDOT, the ‘winning’ team. In this talk Phil will share some* of the tips and tricks for building accurate predictive models (hopefully with some ‘live’ demonstrations using R) and tell the POWERDOT story.

By education, Phil Brierley is a Mechanical Engineer and became involved in predictive analytics during his doctorate where he developed intelligent control systems using neural networks. Phil is the owner of Tiberius Data Mining where he develops the visual data mining softwareTiberius. He has worked at NAB and IBM in advanced analytics and is also freelance data scientist currently working for a Hedge Fund. In his spare time he enjoys honing his techniques in competitive data mining and is a 3 time Kaggle winner.

*Note: The competition is ongoing – so don’t expect to learn everything 😉

Additional Resources:

Video: Google Analytics with R

This post shares the video from the talk presented in September 2013 by Johann de Boer on accessing Google Analytics using R at Melbourne R Users.

In this presentation Johann will share his experience in creating his first open-source R package, ganalytics, used for accessing Google Analytics data. Reflecting on his journey to date in learning R, Johann will give tips to newcomers in helping them succeed in using R for their day to day work and in creating their own packages. A demonstration of ganalytics will follow with an invitation to the community to get involved in its future development.

Johann De Boer manages the Digital Analytics platform for Open Universities Australia. Through the collection and analysis of data, Johann generates insight into the behaviour of consumers to explore opportunities to better meet their needs. He works collaboratively with online marketers, business analysts and web developers to deliver user-centric online optimisations. Using Google Analytics API within the R programming language, Johann focuses on data quality and automation to produce analyses that enhance business insight. Johann is Google Analytics individually qualified and has a background in web analytics, usability and accessibility consulting, spanning industries in Australia, New Zealand, and the UK.

Link to video on YouTube.

Additional Resources:

New Video: Credit Scoring & R: Reject inference, nested conditional models, & joint scores

This post shares the video from the talk presented in August 2013 by Ross Gayler on Credit Scoring and R at Melbourne R Users.

Credit scoring tends to involve the balancing of mutually contradictory objectives spiced with a liberal dash of methodological conservatism. This talk emphasises the craft of credit scoring, focusing on combining technical components with some less common analytical techniques. The talk describes an analytical project which R helped to make relatively straight forward.

Ross Gayler describes himself as a recovered psychologist who studied rats and stats (minus the rats) a very long time ago. Since then he has mostly worked in credit scoring (predictive modelling of risk-related customer behaviour in retail finance) and has forgotten most of the statistics he ever knew.

Credit scoring involves counterfactual reasoning. Lenders want to set policies based on historical experience, but what they really want to know is what would have happened if their historical policies had been different. The statistical consequence of this is that we are required to build statistical models of structure that is not explicitly present in the available data and that the available data is systematically censored. The simplest example of this is that the applicants who are estimated to have the highest risk are declined credit and consequently, we do not have explicit knowledge of how they would have performed if they had been accepted. Overcoming this problem is known as ‘reject inference’ in credit scoring. Reject inference is typically discussed as a single-level phenomenon, but in reality there can be multiple levels of censoring. For example, an applicant who has been accepted by the lender may withdraw their application with the consequence that we don’t know whether they would have successfully repaid the loan had they taken up the offer.

Independently of reject inference, it is standard to summarise all the available predictive information as a single score that predicts a behaviour of interest. In reality, there may be multiple behaviours that need to be simultaneously considered in decision making. These may be predicted by multiple scores and in general there will be interactions between the scores — so they need to be considered jointly in decision making. The standard technique for implementing this is to divide each score into a small number of discrete levels and consider the cross-tabulation of both scores. This is simple but limited because it does not make optimal use of the data, raises problems of data sparsity, and makes it difficult to achieve a fine level of control.

This talk covers a project that dealt with multiple, nested reject inference problems in the context of two scores to be considered jointly. It involved multivariate smoothing spline regression and some general R carpentry to plug all the pieces together.

Additional Resources:

Video: R, ProjectTemplate, RStudio and GitHub: Automate the boring bits and get on with the fun stuff

This post shares the video from the talk presented on 15th May 2013 by Dr Kendra Vant on ProjectTemplate, github and Rstudio at Melbourne R Users.

Overview: Want to minimise the drudge work of data prep? Get started with test driven development? Bring structure and discipline to your analytics (relatively) painlessly? Boost the productivity of your team of data gurus? Take the first step with a guided tour of ProjectTemplate, the RStudio projects functionality and integration with GitHub.

Speaker: Kendra Vant works with the Insight Solutions team at Deloitte, designing and implementing analytic capabilities for corporate and government clients across Australia. Previous experience includes leading teams in marketing analytics and BI strategy, building bespoke enterprise software systems, trapping ions in microchips to create two-bit quantum computers and firing lasers at very cold hydrogen atoms. Kendra has worked in New Zealand, Australia, Malaysia and the US and holds a PhD in Physics from MIT.

Additional Resources:

Video: Using R for causal inference in a study of expensive public policy decisions

This post shares the video from a talk presented on 9th April 2013 by Jim Savage at Melbourne R Users.

Billions of dollars a year are spent subsidising tuition of Australian university students. A controversial report last year by the Grattan Institute, Graduate Winners, asked ‘is this the best use of government money?’

In this talk, Jim Savage, one of the researchers who worked on the report, walks us through the process of doing the analysis in R. The talk will focus on potential pitfalls/annoyances in this sort of research, and on causal inference when all we have is observational data. He will also outline his new method of building synthetic control groups of observational data using tools more commonly associated with data mining.

Jim Savage is an applied economist at the Grattan Institute, where he has researched education policy, the structure of the Australian economy, and fiscal policy. Before that, he worked in macroeconomic modelling at the Federal Treasury.

Additional Resources:

Video: High scale in-database modeling in Greenplum with R

The following post presents the video of a talk by Hong Ooi who presented at Melbourne R Users, March 2013.

Content: Greenplum is a massively parallel relational database platform. R is one of the top languages in the data scientist/applied statistician community. In this talk, Hong gives an overview of how they work together, both with R on the desktop and as an embedded in-database analytics tool.  It’ll be a variation of a talk recently presented at the UseR 2012 Conference.

Speaker: Hong Ooi graduated from Macquarie University with a BEc in actuarial studies, then worked with NRMA Insurance/IAG in Sydney for many years. Completed a Masters in Applied Stats from Macquarie in 1997, and a PhD in statistics from ANU from 2000-2004. Displayed impeccable timing by switching jobs to St George Bank on the eve of the global financial crisis.Moved to Melbourne in 2009, before joining the Greenplum data science team in 2012.


Video: Survey Package in R

Sebastián Duchêne presented a talk at Melbourne R Users on 20th February 2013 on the Survey Package in R.

Talk Overview: Complex designs are common in survey data. In practice, collecting random samples from a populations is costly and impractical. Therefore the data are often non-independent or disproportionately sampled, and violate the typical assumption of independent and identically distributed samples (IDD). The Survey package in R (written by Thomas Lumley) is a powerful tool that incorporates survey designs to the data. Standard statistics, from linear models to survival analysis, are implemented with the corresponding mathematical corrections. This talk will provide an introduction to survey statistics and the Survey package. There will be a brief overview of complex designs and some of the theory behind their analysis, followed by a demonstration using the Survey package.

About the presenter: Sebastián Duchêne is a Ph.D. candidate at The University of Sydney, based at the Molecular Phylogenetics, Ecology, and Evolution Lab. His broad area of research is virus evolution. His current projects include an R package for evolutionary analysis, and the development of statistical models for molecular epidemiology. In addition to his PhD studies, he is a reviewer for the PLoS ONE academic journal in the area of evolution and bioinformatics. Before coming to Sydney, he was a data analyst at the National Oceanic and Atmospheric Administration (NOAA) in the USA. A list of his publications can be found here.

See here for the full list of Melbourne R User Videos.

Video: SimpleR tricks and tools: Help, debugging, git, LaTeX, and workflow with R by Prof Rob Hyndman

This post shares the video from a talk presented on 20th November 2012 by Professor Rob Hyndman at Melbourne R Users. The talk provides an introduction to:

  • Getting R help
  • Debugging R functions
  • R style guides
  • Making good use of Rprofiles.
  • Having a good R workflow
  • Version control facilities
  • Using R with LaTeX (without using sweave or knitr)
  • Turning functions into packages

Prof Rob J Hyndman has used R and its predecessors (S and S+) almost every working day (and some weekends) for the past 25 years. He thought it might be helpful to discuss some of what he has learned and the tricks and tools that he uses. Topics to be discussed will possibly include:

Rob J Hyndman is Professor of Statistics at Monash University and Director of the Monash University Business and Economic Forecasting Unit. He completed a science degree at the University of Melbourne in 1988 and a PhD on nonlinear time series modelling at the same university in 1992. He has worked at the University of Melbourne, Colorado State University, the Australian National University and Monash University. Rob is Editor-in-Chief of the “International Journal of Forecasting” and a Director of the International Institute of Forecasters. He has written over 100 research papers in statistical science. In 2007, he received the Moran medal from the Australian Academy of Science for his contributions to statistical research. Rob is co-author of the well-known textbook “Forecasting: methods and applications” (Wiley, 3rd ed., 1998) and of the book “Forecasting with exponential smoothing: the state space approach” (Springer, 2008). He is also the author of the widely-used “forecast” package for R. For over 25 years, Rob has maintained an active consulting practice, assisting hundreds of companies and organizations on forecasting problems. His recent consulting work has involved forecasting electricity demand, tourism demand and the Australian government health budget. More information is available on his website at

Additional Resources: