Helpful Data Science Reads

January 2, 2018

(This article was first published on some real numbers, and kindly contributed to R-bloggers)

Here are some of the books that I found interesting and useful in 2017.

Scrum: The Art of Doing Twice the Work in Half the Time by Jeff Sutherland

Jeff Sutherland, one of the creators of the scrum methodology of project management lays down the rational for adopting scrum over more traditional project management frameworks. He explains how large-scale software projects often faltered under ‘waterfall’ type approaches but were quickly turned around by ditching Gantt charts and drawing up a scrum board.

Although I believe there is a time and a place for traditional methods, such as long-term projects with fixed budgets and a finite and predictable scope, the agile approach of scrum has huge advantages for analytical projects especially under conditions of quickly shifting stakeholder requirements.

Scrum: The Art of Doing Twice the Work in Half the Time describes the fundamental concepts of scrum and is a must read for data science managers.

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham, Garrett Grolemund

R for Data Science equips data scientists with the tools and techniques to extract data, clean it up and uncover clear insights. The primary toolbox here is R’s tidyverse package, itself a collection of packages that have revolutionised R by making it much easier to use for day to day data management activities. The authors also present a helpful iterative framework for conducting data science projects and address the common issues of messy data and strange formats.

R for Data Science is extremely easy to read, if you haven’t played around with the tidyverse this book is a great place to start. Your life as a data scientist will become much easier!

The Third Wave: An Entrepreneur’s Vision of the Future by Steve Case

Steve Case, co-founder of AOL predicts the future will dominated not by the ‘Internet of Things’ but the ‘Internet of Everything.’ This will be the third wave of the internet. He describes the rising influence of impact investing and how tech and government need to work together to usher in this next phase.

I found Steve’s account of AOL during the early days of the internet fascinating and his prophesies of the future of tech inspiring. Clearly, data scientists have a big part to play in uncovering the insights buried in the huge volumes of data that will arise when the third wave comes into being.

Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman

The then chief data scientist at, John W. Foreman takes readers through several practical exercises in data science and business analytics. Data Smart highlights the point that in order to ‘do data science’ one does not always have to use fancy tools, much of the work can simply be done in Excel. Of course, using tools such as R are often much more convenient, but it is sometimes helpful to hack the problem using Excel, then you know you really understand what you are doing! I found the pragmatic approach to data science put forward by Data Smart most refreshing.

Data Science for Business: What you need to know about data mining and data-analytic thinking by Foster Provost, Tom Fawcett

Data Science for Business presents a comprehensive survey of modern supervised and unsupervised data science methods and applications in a business context. I particularly enjoyed the treatment of model ‘accuracy’ and business cost considerations when settling an acceptable true positive threshold for a chosen model.

This book strikes a nice balance between being too technical and too fluffy. For those really interested, ‘extra for experts’ mathematics sections are available throughout.

An Introduction to Statistical Learning: With Applications in R
by Gareth James, Trevor Hastie, Robert Tibshirani, Daniela Witten

This book is by far the best book I have read on statistical learning. Like Data Science for Business, ISL strikes a great balance between mathematical and intuitive explanations.

What I particularly like about this book is the discussion around the trade-off between variance and bias when choosing a model specification and the handy graphics used to illustrate this point. Another selling point is the collection of exercises that enable the reader to test their knowledge by using R.

Note: if you want to expand on the content found in this book you may consider reading its big brother, The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, Jerome Friedman.

To leave a comment for the author, please follow the link and comment on their blog: some real numbers. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)