“Essentially, all models are wrong, but some are useful.” George Box
Here's a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn't crash and burn in the real world. We've discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it's better than the models that you rejected?
Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.
In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning). To organize the ideas into digestible chunks, we are presenting this article as a four part series. This part (part 1) sets up the specific problem.
Win-Vector blog: HOW DO YOU KNOW IF YOUR MODEL IS GOING TO WORK? PART1: THE PROBLEM