Video: Mining Tweets with R

This post shares the video from the talk presented in 2014 by Eu Jin Lok on mining tweets with R presented at Melbourne R Users.

Twitter is currently ranked in the top 10 for most-visited website, and averaging 500 million tweets a day. It is not just a microblogging outlet used by individuals and celebrities, but also for big commercial organisations, such as Telstra, NAB and Qantas, as a communication channel. However, few companies have deployed data analytics in this space due to the challenges in mining unstructured data. And hence, it is unclear what value can be achieved from mining twitter data. Eu Jin embarks on the journey to explore some of the data mining techniques that can be applied on tweets to uncover potential gems for business or personal use.

Eu Jin Lok is a data analyst for iSelect, a graduate from Monash University with a Masters in Econometrics and has been using R for more than 3 years now both professionally and for causal purposes (eg- Kaggle). This will be his 2nd talk for the MelbURN group and in this talk, he will embark on the task of applying data mining techniques on twitter feeds using a real example.

Additional Resources:

Video: Introduction to R Shiny

This post shares the video from the talk presented in November 2013 by Alec Stephenson providing an introduction to R shiny at Melbourne R Users.

R Shiny, from the people behind R Studio, allows you to quickly and easily build basic web applications using only the R language. I will be demonstrating the basics of web app creation, and will show you a number of examples for purposes such as data visualization and student learning. The talk will require only rudimentary knowledge of R. After the talk (45mins) you are welcome to join me at the Colonial Hotel for dinner.

Alec Stephenson is a CSIRO scientist and a former academic at The National University of Singapore and Swinburne University. It is his third talk for the MelbURN group, following previous talks on spatial data (Sept 2011) and speeding up your R code (Sept 2012). He has been writing R software since the days when there were only a hundred or so R packages. He still dislikes the ifelse function.

Additional Resources:

New Video: Credit Scoring & R: Reject inference, nested conditional models, & joint scores

This post shares the video from the talk presented in August 2013 by Ross Gayler on Credit Scoring and R at Melbourne R Users.

Credit scoring tends to involve the balancing of mutually contradictory objectives spiced with a liberal dash of methodological conservatism. This talk emphasises the craft of credit scoring, focusing on combining technical components with some less common analytical techniques. The talk describes an analytical project which R helped to make relatively straight forward.

Ross Gayler describes himself as a recovered psychologist who studied rats and stats (minus the rats) a very long time ago. Since then he has mostly worked in credit scoring (predictive modelling of risk-related customer behaviour in retail finance) and has forgotten most of the statistics he ever knew.

Credit scoring involves counterfactual reasoning. Lenders want to set policies based on historical experience, but what they really want to know is what would have happened if their historical policies had been different. The statistical consequence of this is that we are required to build statistical models of structure that is not explicitly present in the available data and that the available data is systematically censored. The simplest example of this is that the applicants who are estimated to have the highest risk are declined credit and consequently, we do not have explicit knowledge of how they would have performed if they had been accepted. Overcoming this problem is known as ‘reject inference’ in credit scoring. Reject inference is typically discussed as a single-level phenomenon, but in reality there can be multiple levels of censoring. For example, an applicant who has been accepted by the lender may withdraw their application with the consequence that we don’t know whether they would have successfully repaid the loan had they taken up the offer.

Independently of reject inference, it is standard to summarise all the available predictive information as a single score that predicts a behaviour of interest. In reality, there may be multiple behaviours that need to be simultaneously considered in decision making. These may be predicted by multiple scores and in general there will be interactions between the scores — so they need to be considered jointly in decision making. The standard technique for implementing this is to divide each score into a small number of discrete levels and consider the cross-tabulation of both scores. This is simple but limited because it does not make optimal use of the data, raises problems of data sparsity, and makes it difficult to achieve a fine level of control.

This talk covers a project that dealt with multiple, nested reject inference problems in the context of two scores to be considered jointly. It involved multivariate smoothing spline regression and some general R carpentry to plug all the pieces together.

Additional Resources:

Video: R, ProjectTemplate, RStudio and GitHub: Automate the boring bits and get on with the fun stuff

This post shares the video from the talk presented on 15th May 2013 by Dr Kendra Vant on ProjectTemplate, github and Rstudio at Melbourne R Users.

Overview: Want to minimise the drudge work of data prep? Get started with test driven development? Bring structure and discipline to your analytics (relatively) painlessly? Boost the productivity of your team of data gurus? Take the first step with a guided tour of ProjectTemplate, the RStudio projects functionality and integration with GitHub.

Speaker: Kendra Vant works with the Insight Solutions team at Deloitte, designing and implementing analytic capabilities for corporate and government clients across Australia. Previous experience includes leading teams in marketing analytics and BI strategy, building bespoke enterprise software systems, trapping ions in microchips to create two-bit quantum computers and firing lasers at very cold hydrogen atoms. Kendra has worked in New Zealand, Australia, Malaysia and the US and holds a PhD in Physics from MIT.

Additional Resources:

Video: Using R for causal inference in a study of expensive public policy decisions

This post shares the video from a talk presented on 9th April 2013 by Jim Savage at Melbourne R Users.

Billions of dollars a year are spent subsidising tuition of Australian university students. A controversial report last year by the Grattan Institute, Graduate Winners, asked ‘is this the best use of government money?’

In this talk, Jim Savage, one of the researchers who worked on the report, walks us through the process of doing the analysis in R. The talk will focus on potential pitfalls/annoyances in this sort of research, and on causal inference when all we have is observational data. He will also outline his new method of building synthetic control groups of observational data using tools more commonly associated with data mining.

Jim Savage is an applied economist at the Grattan Institute, where he has researched education policy, the structure of the Australian economy, and fiscal policy. Before that, he worked in macroeconomic modelling at the Federal Treasury.

Additional Resources:

Video: High scale in-database modeling in Greenplum with R

The following post presents the video of a talk by Hong Ooi who presented at Melbourne R Users, March 2013.

Content: Greenplum is a massively parallel relational database platform. R is one of the top languages in the data scientist/applied statistician community. In this talk, Hong gives an overview of how they work together, both with R on the desktop and as an embedded in-database analytics tool.  It’ll be a variation of a talk recently presented at the UseR 2012 Conference.

Speaker: Hong Ooi graduated from Macquarie University with a BEc in actuarial studies, then worked with NRMA Insurance/IAG in Sydney for many years. Completed a Masters in Applied Stats from Macquarie in 1997, and a PhD in statistics from ANU from 2000-2004. Displayed impeccable timing by switching jobs to St George Bank on the eve of the global financial crisis.Moved to Melbourne in 2009, before joining the Greenplum data science team in 2012.


Video: Survey Package in R

Sebastián Duchêne presented a talk at Melbourne R Users on 20th February 2013 on the Survey Package in R.

Talk Overview: Complex designs are common in survey data. In practice, collecting random samples from a populations is costly and impractical. Therefore the data are often non-independent or disproportionately sampled, and violate the typical assumption of independent and identically distributed samples (IDD). The Survey package in R (written by Thomas Lumley) is a powerful tool that incorporates survey designs to the data. Standard statistics, from linear models to survival analysis, are implemented with the corresponding mathematical corrections. This talk will provide an introduction to survey statistics and the Survey package. There will be a brief overview of complex designs and some of the theory behind their analysis, followed by a demonstration using the Survey package.

About the presenter: Sebastián Duchêne is a Ph.D. candidate at The University of Sydney, based at the Molecular Phylogenetics, Ecology, and Evolution Lab. His broad area of research is virus evolution. His current projects include an R package for evolutionary analysis, and the development of statistical models for molecular epidemiology. In addition to his PhD studies, he is a reviewer for the PLoS ONE academic journal in the area of evolution and bioinformatics. Before coming to Sydney, he was a data analyst at the National Oceanic and Atmospheric Administration (NOAA) in the USA. A list of his publications can be found here.

See here for the full list of Melbourne R User Videos.

Video: SimpleR tricks and tools: Help, debugging, git, LaTeX, and workflow with R by Prof Rob Hyndman

This post shares the video from a talk presented on 20th November 2012 by Professor Rob Hyndman at Melbourne R Users. The talk provides an introduction to:

  • Getting R help
  • Debugging R functions
  • R style guides
  • Making good use of Rprofiles.
  • Having a good R workflow
  • Version control facilities
  • Using R with LaTeX (without using sweave or knitr)
  • Turning functions into packages

Prof Rob J Hyndman has used R and its predecessors (S and S+) almost every working day (and some weekends) for the past 25 years. He thought it might be helpful to discuss some of what he has learned and the tricks and tools that he uses. Topics to be discussed will possibly include:

Rob J Hyndman is Professor of Statistics at Monash University and Director of the Monash University Business and Economic Forecasting Unit. He completed a science degree at the University of Melbourne in 1988 and a PhD on nonlinear time series modelling at the same university in 1992. He has worked at the University of Melbourne, Colorado State University, the Australian National University and Monash University. Rob is Editor-in-Chief of the “International Journal of Forecasting” and a Director of the International Institute of Forecasters. He has written over 100 research papers in statistical science. In 2007, he received the Moran medal from the Australian Academy of Science for his contributions to statistical research. Rob is co-author of the well-known textbook “Forecasting: methods and applications” (Wiley, 3rd ed., 1998) and of the book “Forecasting with exponential smoothing: the state space approach” (Springer, 2008). He is also the author of the widely-used “forecast” package for R. For over 25 years, Rob has maintained an active consulting practice, assisting hundreds of companies and organizations on forecasting problems. His recent consulting work has involved forecasting electricity demand, tourism demand and the Australian government health budget. More information is available on his website at

Additional Resources:

Video on S3 Classes in R by Dr Andrew Robinson

This post shares the video from the talk presented on 15th August 2012 by Dr Andrew Robinson on S3 Classes at Melbourne R Users. S3 classes are baked in to R; their influence permeates the language and how we interact with it. This talk introduces S3 classes, and why they are relevant to all R users. The talk covers their definition, interpretation, construction, and manipulation.

Andrew Robinson is deputy director of ACERA and senior lecturer in applied statistics at the University of Melbourne. He is co-author of 2.95 books on R: “Forest Analystics with R” and “Introduction to Scientific Programming and Simulation Using R“, and the hopefully soon to be completed “Methods of Statistical Model Estimation“.

Additional Resources:

Video: Getting staRted with R: An accelerated primer by Lyndon Walker – Melbourne R Users

This post shares the video from a talk presented on June 20 2012 by Dr
Lyndon Walker (see Meetup page).
The talk was titled “Getting staRted with R: An accelerated primer”.

To quote the outline of the talk :

R is a brilliant piece of software but learning it by yourself, particularly
if you have not used command line software before, can be daunting. This
presentation is aimed at introducing beginner to intermediate users of R to
some of the basic features of the program (through to programming a basic
function). Experienced R users are also encouraged to attend to help share
their knowledge and help the first-timers.

This presentation will be interactive, so if you are able to bring a laptop
with R ( ) already installed this will help you
participate (but is not mandatory). Two other handy pieces of software are
the R Studio development interface ( ) and the Tinn-R
text editor ( ), both of which help
you to organise and save your R code.

Lyndon Walker has been using R for nearly half his life. He studied and worked
at the UniveRsity of Auckland, the birthplace of R, and is currently a Senior
Lecturer in Applied Statistics at Swinburne University of Technology in