I’m pleased to share Part I of my new book “Introduction to Reproducible Science in R“. The purpose of this book is to approach model development and software development holistically to help make science and research more reproducible. The need for such a book arose from observing some of the challenges that I’ve seen teaching graduate courses in natural language processing and machine learning, as well as training my own staff to become effective data scientists. While quantitative reasoning and mathematics are important, often I found that the primary obstacle to good data science was reproducibility and repeatability: it’s difficult to quickly reproduce someone else’s results. And this causes myriad headaches:
- It’s difficult to validate a model
- It’s difficult to diagnose a model
- It’s difficult to improve/change a model
- It’s difficult to reuse parts of a model
Ultimately, without repeatability, it’s difficult to trust a model. And if we can’t trust our models, what good are they?
This book therefore focuses on the practical aspects of data science, the tools and workflows necessary to make work repeatable and efficient. Part I introduces a complete Linux-based toolchain to go from basic prototyping to full-fledged operational/production models. It introduces tools like bash, make, git, and docker. I show how all these tools fit together to help imbue structure and repeatability into your project. These are tools that (some) professional data scientists use and can be used throughout your career.
With this foundation established, Part II describes common model development workflows, from exploratory analysis to model design through to operationalization and reporting. I walk through these archetypical workflows and discuss the approaches and tools for accomplishing the steps in the workflows. Some examples include how to design code to compare models, effective approaches for testing code, how to create a server and schedule jobs to run models.
Finally, in Part III I dive into programming. Those that have read my blog know that I look at programming for data science differently from systems programming. This changes the way you program. I’m a strong advocate of functional programming for data science, because it fits better with the mathematics. This part introduces functional programming and discusses data structures from this perspective. I also show how to approach common problems in data science from this view.
In short, this book will not only help you become a better programmer, but a better scientist. I assume the reader knows how to program and has experience creating models. It is appropriate for practitioners, graduate students, and advanced upperclass undergraduates.
Any feedback is appreciated. Feel free to comment here or on Twitter.
What happened to my other book, “Modeling Data With Functional Programming In R”? For those curious, my editor and I decided to postpone publishing it until after this book. I decided that I needed to provide a foundation that people could use to appreciate this other book.