I've been writing software to help others do data analysis for a number of years and at the same time trying to work up my nerve to try my own analysis. Why let other people have all the fun? So, when I saw that Jeffrey Leek, biostatistician at Johns Hopkins and coauthor of Simply Statistics, was teaching an online course in data analysis, I signed up.
The class starts off with an overview of the landscape of data analysis. Like the data-science venn diagram, Leek posits that data analysis is at the intersection of hacking, statistics and domain knowledge.
What follows is my crib-notes form Jeff's slides and from supplementary material. To get started in a cautious frame of mind, we get some wisdom from John Tukey:
“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data...”
To that advice, Leek adds:
...no matter how big the data are.
A cautious data analyst pursues the question at hand with the appropriate type of analysis and will avoid going further than the available data allows.
Types of data analysis
- Descriptive - Summarize and highlight, leaving generalization, interpretation and modeling for later.
- Exploratory - Discover new relationships and define future studies, requires confirmation.
- Inferential - Estimate values on a large population based on a small sample and quantifying uncertainty.
- Predictive - Use data to estimate unmeasured values. If X predicts Y, X does not necessarily cause Y, which is just another way of saying correlation does not imply causation.
- Causal - Find effect on one variable of changes in another. Randomized studies are usually required.
- Mechanistic - Typically, deterministic equations are known, but the parameters must be inferred. Think physics.
Steps in a data analysis
- Define question
- Define ideal data set
- Determine what data you can access
- Obtain data
- Clean data
- Exploratory analysis
- Statistical prediction/modeling
- Interpret results
- Challenge results
- Synthesize/write-up results
- Create reproducible code
The class is taught in R. Early lectures cover basics like how R's type system represents continuous and categorical data. Next come basic data munging operations like binning with cut, subset, sort, merge and reshape.
The goal of data munging is to produce a clean data - data that is amenable to analysis. Hadley Wickham's paper on tidy data, defines a set of properties closely related to database normalization oriented towards getting data ready for further manipulation, visualization and modeling. This is part of what my colleague, Brig, calls data activation.
Properties of Tidy data
- One variable per column
- One observation per row
- Tables hold elements of only one kind
- Column names are easy to use and informative
- Row names are easy to use and informative
- Obvious mistakes in the data have been removed
- Variable values are internally consistent
- Appropriate transformed variables have been added
Luckily for us, data is the philosophy of the day. The unreasonable effectiveness of data is widely appreciated, and there is more data than analysis talent available. There are loads of resources for helping students of data analysis grow into data scientists.
- Open government data from many sources: data.gov, france UK GapMinder List of cities/states with open data, civic commons, many served by Seattle startup Socrata.
- Hilary Mason's research data
- Stanford Large Newtork Data
- UCI Machine Learning
- KDD Nugets Datasets
- CMU Statlib
- Gene expression omnibus
- ArXiv Data
- Spambase data set from the UC Irvine Machine Learning Repository