Recently, I came up with Thoen’s law. It is an empirical one, based on several years of doing data science projects in different organisations. Here it is: The probability that you have worked on a data science project that failed, approaches one very quickly as the number of projects done grows. I think many, far more than we as a community like to admit, will deal with projects that don’t meet their objectives. This blog does not explore why data science projects have a high risk of failing. Jonathan Nolis already did this adequately. Rather, I’ll look for strategies how we might deal with projects that are failing. Disappointing as they may be, failed projects are inherently part of the novel and challenging discipline data science is in many organisations. The following approach might reduce the probability of failure, but that is not the main point. Their objective is to prevent failing in silence after too long a period of project time. In which you try to figure out things on your own. They will shift failure from the silent personal domain to the public collective one. Hopefully, reducing stress and blame by yourself and others.
Make failing an option from the start
At the beginning of a project the levels enthusiasm and optimism are always at its peak. Especially in data science projects. Isn’t data the new oil? This is the time we are finally going to dig into that well and leverage our data in unprecedented ways! No setbacks are experienced yet. There is only one road ahead and it will lead us to success. Probably at this stage you, the data scientist, are already well aware of a number of project risks. You might want to keep these concerns to yourself, as you don’t want to come across as negative, or worse, someone who is not up to the job ahead. Please don’t, if you foresee possible problems at this stage and you don’t speak out, they can come back as a boomerang when the problems actually occur. Rather, invite all stakeholders to perform a risk analysis together. As a group, you list the requirements for a successful outcome, and try to identify what can get in the way of them. These requirements differ from project to project, of course. However, usual suspects are having enough history in the database, having data of adequate quality, being able to join data from different sources, having an organization that is ready to adopt the project, and having a strong enough relationship between relevant variables in the first place. Doing a risk analysis serves two purposes. Obviously, by describing possible problems up front, it is more likely they can be prevented or mitigated. Moreover, it can subtly shift the jubilant mood that is so typical for stakeholders at a data science project start, to a more realistic vision on the project. Making them realise there is no guaranteed success.
Plan realistically and put in [email protected] slack
Doing data science well takes time. Time to properly understand the problem, time write good quality code, time to figure out the relationships in your data. Whether it is by your boss, a project manager, a client, or your colleague, you are going to be asked how long it is going to take you to complete (a part of) the job. Trying to please when having to give an estimate, will almost certainly backfire later on. Try to list all the components that are part of the total job ahead and have a realistic estimate of how much time it will take you to do each properly. Next, and this is crucial, add a [email protected] slack to it. This is a percentage of time that the project is going to last longer because you are going to [email protected] How am I so sure you will? Because data science is hard and you are human. Junior, medior, senior, we all [email protected] Numbers don’t add up because we programmed something incorrectly, taking us a day-and-half to find the error. We thought we understood the data, but we didn’t, some part of the analysis needs to be redone. We finally have a fancy server to train on the full set, but it keeps running out of memory while it shouldn’t. You can [email protected] in so many ways, so saying you will somehow is a pretty save bet. I think adding ten to twenty percent to the project time for unforeseen [email protected] is certainly not too much.
Keep stakeholders in the loop
This is as obvious as it is postponed or not done at all. A good project manager often asks the data scientist how things are going, and communicates progress to stakeholders. If you are in a project in which this there is no project manager, make it your responsibility to inform the stakeholders at due times. Be disciplined and meet with them, or if this cannot be arranged email them, at preset moments. Don’t delay updates until you have better or more news than you currently do. There is a big pitfall in letting thoughts like “I am sure the model will improve by x when I try this new fancy algorithm, just need to get it running, shouldn’t take long” postpone your updates. Don’t be apologetic in the updates, try to be as factual as you can be. “We tried this and it gave no performance enhancement, we are now off to try this.” If a project is going in the direction of failure, the stakeholders are aware of this from the start. Making actual failure better to accept than when they are confronted with it unexpectedly.
Write a final report
Often failed projects are never really completed. There is always new stuff left to try. Maybe a different algorithm, or a different data source. All the time that was in the original planning has been used up, the project goals are not yet met, stakeholders start losing interest and the project is left to linger. Leaving you dissatisfied and maybe unwilling to give up. Writing a report that describes why the project was not a success, is a good way to officially close it. Try to write down as meticulous as you can what the project goals were and why they were not attained. Again being factual is the way to go, quantify, quantify, quantify. “For 101,221 out of the 567,436 customers in database A, there was no record in database B. So, for 17.8% of all our customers this crucial predictor was not available.” If you think there is still life in the project, include recommendations for restarting it. The final report informs the stakeholder and forces you to objectively assess the failure, thereby reducing self-blame for an unfinished, unsuccessful project.