Site icon R-bloggers

What is “Practical Data Science with R”?

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.


What is Data Science?

The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as the art of transforming hypotheses and data into actionable predictions. For example, we can use models and data to predict who will win an election, what products will sell well together, which loans will default, or which advertisements will be clicked on.

Data science draws on tools from the empirical sciences, statistics, reporting, analytics, visualization, business intelligence, expert systems, machine learning, databases, data warehousing, data mining, and big data. It is because we have so many tools that we need a discipline that covers them all. What distinguishes data science itself from the tools and techniques is the central goal of deploying effective decision-making models to a production environment.

Data science is often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts or scientists. By adding a few more techniques to their repertoire they became excellent data scientists. That observation drives this book: we will introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you will know better than we do, some you will pick up quickly, and some you may need to research further.

Much of the theoretical basis of data science comes from statistics. However, data science as we know it is very much influenced by technology and software engineering methodologies, and has largely evolved in heavily computer science and information technology driven groups. We can call out some of the engineering flavor of data science by listing some famous examples:

These systems share a lot of features:

This book will teach the principles and tools needed to build systems like these. We want to teach the common tasks, steps and tools used to successfully deliver such projects. Our emphasis is on the whole process, project management, working with others and presenting results to non-specialists.

Why this book?

This is the book for you if you want to work as a data scientist, or already do. This book will demonstrate the tools, habits and interactions of successful data scientists and data science projects. We have learned a lot from the many different people and fields that work in learning from data and here we distill down and share the best practices. Some chapters are elementary and some are advanced, but all chapters contain things we wish we had known a lot earlier.

This is the book we wish we had available to hand out to clients and peers. Its purpose is to explain the best parts of statistics, computer science and machine learning that are relevant to data science. Most data scientists have arrived recently from some other field, so can still benefit in being reminded of some of the best tools from the many fields that contribute to data science. A software engineer who works as a data scientist will likely benefit in seeing a bit of explanation about statistical testing procedures and machine learning procedures. A statistician may be unfamiliar of the software engineering techniques of version control and agile project management and how much these things can greatly increase the chance of success in a project.

Throughout this book we are going to emphasize scientific principles such as repeatability of experiments. We will also emphasize software engineering principles such as automation of steps. We see scientific principles and software engineering principles as being co-equal ways to think about data science projects. You automate steps because you will have to repeat them and you can repeat steps because of your version control and automation.

We don’t want to invent techniques in this book, but explain the best standard techniques. A real victory for this book would for an experienced data scientist to says “I always knew to split my data into test and training sets, but I had no idea how many things that was protecting me from!” Or, perhaps, for a software engineer who is starting to work as a data scientist to invent a new tool to automate test and train splits.

Throughout we are going to write about concepts (both statistics and machine learning), include concrete code and explore partnering with and presenting to non-specialists. We hope when you don’t find one of these topics novel that we are able to share a wrinkle on one or two of the other topics that you may not have thought about recently. We encourage you to try the example R code as you read the text; even when we are discussing fairly abstract aspects of data science we will illustrate examples with concrete data and code. We are arranging topics in book in an order that we feel increases understanding. This order may not always be the order of the tasks in sequence. < !--For instance: what business goals can be met by building models and how to evaluate modes are both discussed before we get into the details of model construction.-->

What is in this book?

What is not in this book?

Who are the authors?

The first author, Nina Zumel, has worked as a scientist at one the largest independent, nonprofit research institutes. She has worked as chief scientist of a price optimization company and founded a contract research company. Nina Zumel is now a principal consultant at Win-Vector LLC. She can be reached at < ulink>nzumel@win-vector.com.

< !--

Nina Zumel

< mediaobject> < imageobject> < imagedata fileref="figures/FrontMatter/NinaZumel.jpg"/>

–>

The second author, John Mount, has worked as a computational scientist in biotechnology, a stock trading algorithm designer and managed a research team for a major online shopping site. John Mount is now a principal consultant at Win-Vector LLC. He can be reached at < ulink>jmount@win-vector.com.

< !--

John Mount

< mediaobject> < imageobject> < imagedata fileref="figures/FrontMatter/JohnMount.jpg"/>

–>

Related posts:

  1. Data Science, Machine Learning, and Statistics: what is in a name?
  2. Data science project planning
  3. Setting expectations in data science projects

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.