## H2O Benchmark for CSV Import

June 25, 2017
By

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

## Data visuals notes for my talks in 2017

June 25, 2017
By

Data visuals 2017 Supplementary notes for CJ Brown’s talks on dataviz in 2017 for Griffith University’s honours students and the UQ Winterschool in Bioinformatics. Skip to the quiz Structure of this talk Tools for dataviz Eleven principles...

## bigrquery 0.4.0

June 25, 2017
By

I’m pleased to announce that bigrquery 0.4.0 is now on CRAN. bigrquery makes it possible to talk to Google’s BigQuery cloud database. It provides both DBI and dplyr backends so you can interact with BigQuery using either low-level SQL or high-level dplyr verbs. Install the latest version of bigrquery with: install.packages("bigrquery") Basic usage Connect to a bigquery database using DBI: library(dplyr) con "github_nested"...

## Balancing on multiple factors when the sample is too small to stratify

June 25, 2017
By

Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not guarantee balance, especially with smaller sample sizes. If...

## Hex stickers for the forecast package

June 25, 2017
By

I’ve caved in to the hex sticker craze, and produced some hex stickers for the forecast package for R. If you attend a workshop I teach, I’ll give you one. Otherwise you can order (in bulk) from hexi.pics.

## Data visuals notes for my talks in 2017

June 25, 2017
By

Data visuals: notes for my talks in 2017 Supplementary notes for CJ Brown’s talks on dataviz in 2017 for Griffith University’s honours students and the UQ Winterschool in Bioinformatics. Skip to the quiz Visualsing sexual dimorphism in elephant ...

## Data Visualization with googleVis exercises part 4

June 25, 2017
By

Adding Features to your Charts We saw in the previous charts some basic and well-known types of charts that googleVis offers to users. Before continuing with other, more sophisticated charts in the next parts we are going to “dig a little deeper” and see some interesting features of those we already know. Read the examples Related exercise sets: Data Visualization...

## R Weekly Bulletin Vol – XII

June 25, 2017
By

This week’s R bulletin will cover topics on how to resolve some common errors in R. Hope you like this R weekly bulletin. Enjoy reading! Shortcut Keys 1. Find and Replace – Ctrl+F 2. Find Next – F3 3. Find Previous – Shift+F3 Problem Solving Ideas Resolving the ‘: cannot open the connection’ Error There... The post R Weekly Bulletin...

## Using Tweedie Parameter to Identify Distributions

June 24, 2017
By

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity? In my previous post https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas, it is shown how to

## Using tidycensus and leaflet to map Census data

June 23, 2017
By

Recently, I have been following the development and release of Kyle Walker’s tidycensus package. I have been filled with amazement, delight, and well, perhaps another feeling… There should be a word for “the regret felt when an R 📦, which would have saved untold hours of your life, is released”… #rstats 🤔 https://t.co/2THN4MwedO — Mara Averick (@dataandme) May 31, 2017 But seriously,...

## Track changes in data with the lumberjack %>>%

June 23, 2017
By

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R __ data(retailers, … Continue reading →

## The R community is one of R’s best features

June 23, 2017
By

R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that's an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers...

## Logarithmic Scale Explained with U.S. Trade Balance

June 23, 2017
By

Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest. Consider U.S. 2016 merchandise trade partner balances data set where each point is a country...

## Working With SPSS© Data in R

Introduction I was in need of importing SPSS© data for work. There are some options but I've used both foreign and haven R packages. I prefer haven because it integrates better with R's tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions. The...

## State-space modelling of the Australian 2007 federal election

June 23, 2017
By

Pooling the polls with Bayesian statistics In an important 2005 article in the Australian Journal of Political Science, Simon Jackman set out a statistically-based approach to pooling polls in an election campaign. He describes the sensible intuitive...

## Operations Research with R

June 23, 2017
By

Stefan Feuerriegel This blog entry concerns our course on “Operations Reserch with R” that we teach as part of our study program. We hope that the materials are of value to lectures and everyone else working in the field of numerical optimiatzion. Course outline The course starts with a review of numerical and linear algebra … Continue reading "Operations...

## Hey! You there! You are welcome here

June 23, 2017
By

What's that? You've heard of R? You use R? You develop in R? You know someone else who's mentioned R? Oh, you're breathing? Well, in that case, welcome! Come join the R community! We recently had a group discussion at rOpenSci's #runconf17 in Los Angeles, CA about the R community. I initially opened the issue on GitHub. After this...

## Face Recognition in R

June 22, 2017
By

Face Recognition in R OpenCV is an incredibly powerful tool to have in your toolbox. I have had a lot of success using it in Python but very little success in R. I haven’t done too much other than searching Google but it seems as if “imager” and “videoplayR” provide a lot of the functionality

## May New Package Picks

June 22, 2017
By

Two hundred and twenty-nine new packages were submitted to CRAN in May. Here are my picks for the “Top 40”, organized into five categories: Data, Data Science and Machine Learning, Education, Miscellaneous, Statistics and Utilities. Data angstroms v0.0.1: Provides helper functions for working with Regional Ocean Modeling System (ROMS) output. bikedata v0.0.1: Download and aggregate data from public bicycle systems from around...

## Set Theory Arbitrary Union and Intersection Operations with R

June 22, 2017
By

Part 3 of 3 in the series Set TheoryThe union and intersection set operations were introduced in a previous post using two sets, and . These set operations can be generalized to accept any number of sets. Arbitrary Set Unions Operation Consider a set of infinitely many sets: It would... The post Set Theory Arbitrary Union and Intersection Operations with...

## RTutor: Emission Certificates and Green Innovation

Which policy instruments should we use to cost-effectively reduce greenhouse gas emissions? For a given technological level there are many economic arguments in favour of tradeable emission certificates or a carbon tax: they generate static efficiency ...

## Interactive R visuals in Power BI

June 22, 2017
By

Power BI has long had the capability to include custom R charts in dashboards and reports. But in sharp contrast to standard Power BI visuals, these R charts were static. While R charts would update when the report data was refreshed or filtered, it wasn't possible to interact with an R chart on the screen (to display tool-tips, for...

## Two years as a Data Scientist at Stack Overflow

June 22, 2017
By

Last Friday marked my two year anniversary working as a data scientist at Stack Overflow. At the end of my first year I wrote a blog post about my experience, both to share some of what I’d learned and as a form of self-reflection. After another year, I’d like to revisit the topic. While my first post focused mostly on...

## Online portfolio allocation with a very simple algorithm

June 22, 2017
By
$Online portfolio allocation with a very simple algorithm$

By Yuri Resende   Today we will use an online convex optimization technique to build a very simple algorithm for portfolio allocation. Of course this is just an illustrative post and we are going to make some simplifying assumptions. The … Continue reading →

June 22, 2017
By

Introduction Most of the data readily available in the real world comes unlabeled. Getting the labels often entails manual classification, which can be a tedious and The post Cluster Analysis of Twitter: Understanding Human Interactions for Business Improvement appeared first on NYC Data Science Academy Blog.

## Data wrangling : Reshaping

June 22, 2017
By

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be Related exercise sets: Data Shape...

## nanotime 0.2.0

June 22, 2017
By

A new version of the nanotime package for working with nanosecond timestamps just arrived on CRAN. nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic. Thanks to a metric ton of work by Leonardo Silvestri, the package now uses S4 classes internally allowing for...

## Can we predict flu deaths with Machine Learning and R?

June 22, 2017
By

Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as analysed by Kucharski et al (2014). I will be using their data as an example to test whether we can use Machine Learning algorithms for Related Post Graphical Presentation of...

## Introducing Community Tutorials

June 22, 2017
By

Today we’re introducing Datazar Community Tutorials. At Datazar, we love writing tutorials and how-tos on R, Python, D3, research and science best practices in general. So starting today, we’re extending that ability to our users so they can share ...