Data analysis class

February 7, 2013
By

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

I’ve been writing software to help others do data analysis for a number of years and at the same time trying to work up my nerve to try my own analysis. Why let other people have all the fun? So, when I saw that Jeffrey Leek, biostatistician at Johns Hopkins and coauthor of Simply Statistics, was teaching an online course in data analysis, I signed up.

The class starts off with an overview of the landscape of data analysis. Like the data-science venn diagram, Leek posits that data analysis is at the intersection of hacking, statistics and domain knowledge.

What follows is my crib-notes form Jeff’s slides and from supplementary material. To get started in a cautious frame of mind, we get some wisdom from John Tukey:

“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data…”

…no matter how big the data are.

A cautious data analyst pursues the question at hand with the appropriate type of analysis and will avoid going further than the available data allows.

Types of data analysis

• Descriptive – Summarize and highlight, leaving generalization, interpretation and modeling for later.
• Exploratory – Discover new relationships and define future studies, requires confirmation.
• Inferential – Estimate values on a large population based on a small sample and quantifying uncertainty.
• Predictive – Use data to estimate unmeasured values. If X predicts Y, X does not necessarily cause Y, which is just another way of saying correlation does not imply causation.
• Causal – Find effect on one variable of changes in another. Randomized studies are usually required.
• Mechanistic – Typically, deterministic equations are known, but the parameters must be inferred. Think physics.

On process, Leek outlines a series of steps similar to those articulated by Hadley Wickham (Engineering with data analysis) and
Jeffrey Heer.

Steps in a data analysis

1. Define question
2. Define ideal data set
3. Determine what data you can access
4. Obtain data
5. Clean data
6. Exploratory analysis
7. Statistical prediction/modeling
8. Interpret results
9. Challenge results
10. Synthesize/write-up results
11. Create reproducible code

The class is taught in R. Early lectures cover basics like how R’s type system represents continuous and categorical data. Next come basic data munging operations like binning with cut, subset, sort, merge and reshape.

The goal of data munging is to produce a clean data – data that is amenable to analysis. Hadley Wickham’s paper on tidy data, defines a set of properties closely related to database normalization oriented towards getting data ready for further manipulation, visualization and modeling. This is part of what my colleague, Brig, calls data activation.

Properties of Tidy data

• One variable per column
• One observation per row
• Tables hold elements of only one kind

Plus

• Column names are easy to use and informative
• Row names are easy to use and informative
• Obvious mistakes in the data have been removed
• Variable values are internally consistent
• Appropriate transformed variables have been added

Luckily for us, data is the philosophy of the day. The unreasonable effectiveness of data is widely appreciated, and there is more data than analysis talent available. There are loads of resources for helping students of data analysis grow into data scientists.

Resources

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Recent popular posts

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)