Teaching Practical Data Science with R

November 16, 2016
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of.

I have written before how I think this book stands out and why you should consider studying from it.

600 387630642

Please read on for a some additional comments on the intent of different sections of the book.With Practical Data Science with R we wanted to help new data scientists and analysts get their bearings. We wanted to help them know what was expected of them and some tools and techniques that would help them in their tasks. We are trying to teach through “data scientists’ block” or “analysts’ blank page syndrome.” We chose R because it is an excellent analysis platform, and sufficiently self-contained that one can work on any step of the data science process without already being a mystical data science unicorn. It is a book trying to teach you what to do, with examples of it being done.

We worked very hard on each chapter, some of which represented opportunities to re-do things we had already written on with the benefits of editors. Also it was a chance to not always be lost in the technical details. Some of the chapters take special advantage of that. I’d like to call out these particular chapters.

The core of the book includes:

  • Chapter 1 The data science process

This chapter tells you a lot about the nature of the work. Not a lot of books cover this (one notable exception being Doing Data Science: Straight Talk from the Frontline O’Neil, Schutt; O’Reilly 2014). A lot of analyst tasks are being taken over as “data science tasks” so necessarily a lot of people will have to be recognized as data scientists. It makes sense to see some description of the roles and expectations to see if the job (not just the job title) appeals to you.

  • Chapter 3 Exploring data
  • Chapter 4 Managing data
  • Chapter 5 Choosing and evaluating models
  • Chapter 6 Memorization methods

This sequence of chapters form the heart of the book. It starts with data and moves through the concept of modeling. Discussion of particular statistical and machine learning methods (such as linear regression, logistic regression, random forests, and support vector machines) are held off until after this core sequence.

We spend a lot of time on the neglected topic of data preparation because there are many more opportunities for model performance improvement at the “intake end” (variables) than at the “outtake end” (re-processing modeling results). Some of the ideas from this sequence have since been further refined (and documented) in our open source vtreat package.

  • Chapter 10 Documentation and deployment
  • Chapter 11 Producing effective presentations

These chapters are the epilogue of the book, they emphasize how to collaborate with others.

The remaining chapters are the nuts and bolts:

  • Chapter 2 Loading data into R
  • Chapter 7 Linear and logistic regression
  • Chapter 8 Unsupervised methods
  • Chapter 9 Exploring advanced methods

These chapters concentrate on how tools that allow you to pursue the goals and tasks of the other chapters actually work. For instance an unstated goal of Chapter 7 was to be able to read almost every scrap of summary that R reports for lm and glm models. We even included how to calculate the (oddly missing) overall model significance for glm (a feature now supplied in our sigr package). Every scrap of data and code needed to reproduce the results in these chapters is shared in our book Github repository (including re-runs of all steps as R Markdown worksheets).

We could have written a book that was only these chapters expanded, but we felt the core material was so under-taught that spending a bit more time on that would be higher value to the reader.

And that is my rough outline of Practical Data Science with R.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)