Data analysis class

Posted on February 7, 2013 by Christopher Bare in Uncategorized | 0 Comments

[This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been writing software to help others do data analysis for a number of years and at the same time trying to work up my nerve to try my own analysis. Why let other people have all the fun? So, when I saw that Jeffrey Leek, biostatistician at Johns Hopkins and coauthor of Simply Statistics, was teaching an online course in data analysis, I signed up.

The class starts off with an overview of the landscape of data analysis. Like the data-science venn diagram, Leek posits that data analysis is at the intersection of hacking, statistics and domain knowledge.

What follows is my crib-notes form Jeff’s slides and from supplementary material. To get started in a cautious frame of mind, we get some wisdom from John Tukey:

“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data…”

To that advice, Leek adds:

…no matter how big the data are.

A cautious data analyst pursues the question at hand with the appropriate type of analysis and will avoid going further than the available data allows.

Types of data analysis

Descriptive – Summarize and highlight, leaving generalization, interpretation and modeling for later.
Exploratory – Discover new relationships and define future studies, requires confirmation.
Inferential – Estimate values on a large population based on a small sample and quantifying uncertainty.
Predictive – Use data to estimate unmeasured values. If X predicts Y, X does not necessarily cause Y, which is just another way of saying correlation does not imply causation.
Causal – Find effect on one variable of changes in another. Randomized studies are usually required.
Mechanistic – Typically, deterministic equations are known, but the parameters must be inferred. Think physics.

On process, Leek outlines a series of steps similar to those articulated by Hadley Wickham (Engineering with data analysis) and Jeffrey Heer.

Steps in a data analysis

Define question
Define ideal data set
Determine what data you can access
Obtain data
Clean data
Exploratory analysis
Statistical prediction/modeling
Interpret results
Challenge results
Synthesize/write-up results
Create reproducible code

The class is taught in R. Early lectures cover basics like how R’s type system represents continuous and categorical data. Next come basic data munging operations like binning with cut, subset, sort, merge and reshape.

The goal of data munging is to produce a clean data – data that is amenable to analysis. Hadley Wickham’s paper on tidy data, defines a set of properties closely related to database normalization oriented towards getting data ready for further manipulation, visualization and modeling. This is part of what my colleague, Brig, calls data activation.

Properties of Tidy data

One variable per column
One observation per row
Tables hold elements of only one kind

Plus

Column names are easy to use and informative
Row names are easy to use and informative
Obvious mistakes in the data have been removed
Variable values are internally consistent
Appropriate transformed variables have been added

Luckily for us, data is the philosophy of the day. The unreasonable effectiveness of data is widely appreciated, and there is more data than analysis talent available. There are loads of resources for helping students of data analysis grow into data scientists.

Data sources

Open government data from many sources: data.gov, france UK GapMinder List of cities/states with open data, civic commons, many served by Seattle startup Socrata.
asdfree
Infochimps
Kaggle
Hilary Mason's research data
Stanford Large Newtork Data
UCI Machine Learning
KDD Nugets Datasets
CMU Statlib
Gene expression omnibus
ArXiv Data
Spambase data set from the UC Irvine Machine Learning Repository

API's

Resources

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data analysis class

Types of data analysis

Steps in a data analysis

Properties of Tidy data

Data sources

API's

Resources

Related

Types of data analysis

Steps in a data analysis

Properties of Tidy data

Data sources

API's

Resources

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)