Blog Archives

Absolute Deviation Around the Median

August 8, 2013
By
Absolute Deviation Around the Median

Median Absolute Deviation (MAD) or Absolute Deviation Around the Median as stated in the title, is a robust measure of central tendency. Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. This

Read more »

The R User Conference 2013: Albacete, Spain

July 23, 2013
By
The R User Conference 2013: Albacete, Spain

I was fortunate enough to attend the 2013 UseR! conference in Albacete, Spain this year. I had a great time meeting fellow R users and exchanging ideas on R implementations. The conference is also one of the few opportunities to gain exposure to uses of R in other disciplines because there are so many talks

Read more »

Shiny Server on CentOS

June 29, 2013
By

I’ve been enjoying working with Joe Cheng’s Shiny Server and wanted to create a quick step-by-step guide on installing it on an AWS CentOS EC2 instance as the standard Shiny Server instructions assume the typical dependencies are installed: 1. Shiny’s instructions say to install libssl-dev (sudo yum install libssl-dev), here is the CentOS equivalent : sudo yum install openssl-devel

Read more »

Data imputation I

June 12, 2013
By

I recently entered kaggle titanic learning competition for fun and to see where my out of the box utilization of random forest would rank me (303 out of 5,882). It was interesting to see that much of the scoring differentiation came from score imputation, that is filling missing values based on other data. For example, we might have

Read more »

ggplot2 graphics in a loop

April 29, 2013
By

A client has a specific audit they perform quarterly across 200 of their manufacturing plants. The audit has 8 distinct sections examining the different areas of the plant (shipping, receiving, storage, packaging,etc.) Instead of having one cumulative final score, the audit displays a final score for each section. I wanted to examine the distribution of

Read more »

Predicting Dichotomous Outcomes I

April 14, 2013
By
Predicting Dichotomous Outcomes I

We are trying to predict a dependent dichotomous variable (male/female, yes/no, like/dislike,etc) with independent “predictor” variables. Let’s say we want to determine whether or not an employee will quit based on the percentage of their tenure spent traveling. We assemble the data from HR and erroneously employ simple linear regression to model the relationship, a

Read more »

Gradient Boosting: Analysis of LendingClub’s Data

April 8, 2013
By
Gradient Boosting: Analysis of LendingClub’s Data

An old 5.75% CD of mine recently matured and seeing that those interest rates are gone forever, I figured I’d take a statistical look at LendingClub’s data. Lending Club is the first peer-to-peer lending company to register its offerings as securities with the Securities and Exchange Commission (SEC). Their operational statistics are public and available for download. The latest

Read more »

Data visualization with R and ggplot2

March 28, 2013
By
Data visualization with R and ggplot2

I’m working on a one-hour ggplot2 lecture for the San Diego R users group, which I will post here when I’m done. I think there are many great intro to R data visualization resources out there so I’ll only share working examples on my blog. A retail chain client employs a few hundred field agents who perform

Read more »

Samsung Phone Data Analysis Project

March 19, 2013
By
Samsung Phone Data Analysis Project

Below are my findings from the second data analysis project in Dr. Jeffery Leek’s John Hopkins Coursera class. Introduction I used the  “Human Activity Recognition Using Smartphones Dataset” (UCI, 2013) to build a model. This data  was recorded from a Samsung prototype smartphone with a built-in accelerometer. The purpose of my model was to recognize the type

Read more »

Layman’s Random Forests

March 18, 2013
By

I’m not a fan of the Top 40 style content on Quora, but a student in Dr. Leek’s Coursera class shared this absolute gem from Edwin Chen. I have not seen a better explanation: How do random forests work in layman’s terms? Suppose you’re very indecisive, so whenever you want to watch a movie, you ask

Read more »