The Cross Industry Standard Process for Data Mining (CRISP-DM) was a concept developed 20 years ago now. I’ve read about it in various data mining and related books and it’s come in very handy over the years. In this post, I’ll outline what the model is and why you should know about it, even if it has that terribly out of vogue phrase data mining in it!
Data / R people. Do you know what the CRISP-DM model is?
— Steph Locke (@SteffLocke) January 8, 2017
The model splits a data mining project into six phases and it allows for needing to go back and forth between different stages. I’d personally stick a few more backwards arrows but it’s generally fine. The CRISP-DM model applies equally well to a data science project.
Typical activities in each phase
In Data Mining Techniques in CRM, a very readable book, they outline in Table 1.1. some typical activities within each phase:
- Business Understanding
- Understanding the business goal
- Situation assessment
- Translating the business goal in a data mining objective
- Development of a project plan
- Data understanding
- Considering data requirements
- Initial data collection, exploration, and quality assessment
- Data preparation
- Selection of required data
- Data acquisition
- Data integration and formatting […]
- Data cleaning
- Data tranformation and enrichment […]
- Selection of appropriate modeling technique
- […] Splitting of the dataset into training and testing subsets for evaluation purposes
- Development and examination of alternative modeling algorithms and parameter settings
- Fine tuning of the model settings according to an initial assessment of the model’s performance
- Model evaluation
- Evaluation of the model in the context of the business success criteria
- model approval
- Create a report of findings
- Planning and development of the deployment procedure
- Deployment of the […] model
- distribution of the model results and integration in the organisation’s operational […] system
- Development of a maintenance / update plan
- Review of the project
- Planning the next steps
The CRISP-DM process outlines the steps involved in performing data science activities from business need to deployment, and most importantly it indicates how iterative this process is and that you never get things perfectly right.
Within a given project, we know that at the beginning of our first ever project we may not have a lot of domain knowledge, or there might be problems with the data or the model might not be valuable enough to put into production. These things happen, and the really nice thing about the CRISP-DM model is it allows for us to do that. It’s not a single linear path from project kick-off to deployment. It helps you remember not to beat yourself up over having to go back a step. It also equips you with something upfront to explain to managers that sometimes you will need to bounce between some phases, and that’s ok.
All models are wrong but some are useful (George Box)
We also know that our model is not going to be perfect. By the end of the project, our model’s value is already deteriorating! We get new customers, people change, the world changes, the business changes. Everything is conspiring against your model. This means it requires regular TLC for it remain of value. We might need to just regular adjust slightly for the latest view of the world (re-calibration) or we might need to take another tilt at modelling the problem again. The big circle around the process shows this fact of a data scientist’s life.
Working from the expectation that we will be iterative, we can start planning cycles of work. These might start with a short, small, simple model cycle to get a basic model quickly. Then further iterations can develop stronger models. The business gets some immediate benefit and it can then continue getting additional benefit from further cycles, or people could be moved onto building the next quick and simple model.
This gives the business a better high-level view of where data scientists are adding value and it means if the company is evolving the processes and data engineering capabilities at the same time, then a broad range of simple models can be first developed and implemented, giving learning experiences for all involved.
Estimation of project work and scoping is often difficult for data science projects, and that does need to change. One thing we can do is take the CRISP-DM phases and typical activities and build checklists and process frameworks around them. We can start moving each “bespoke” activity into a “cookie-cutter” activity.
One simple way of doing this is to start with a checklist. I am a big fan on checklists, more so after reading The Checklist Manifesto. You can build a manual checklist for people to work through to make sure important tasks are completed, that considerations from past projects are addressed, and you can ensure that ethical, regulatory, and legal considerations are considered at the right points in the development cycle.
The Microsoft Team Data Science Process is a developing framework that broadly follows the CRISP-DM model and is bringing in templates and tools to help data scientists. It’s proving quite interesting and I would recommend it as follow up reading.
Thinking about how we work
I read a lot of productivity, project management, and framework books. I’m always interested in how we can do our jobs better. Usually, this boils down to making things simpler and helping ensure we do the right things at the right time. The CRISP-DM is one simple thing that has helped me put that structure onto what often seems a chaotic process. I hope it could offer you some benefit and I’d be really interested to hear your thoughts, experiences, and tips for building better data science workflows.