“Overengineering is the act of designing a product to be more robust or have more features than often necessary for its intended use, or for a process to be unnecessarily complex or inefficient.” This is how the Wikipedia page on overengineering starts. It is the diligent engineer who wants to make sure that every possible feature is incorporated in the product, that creates an overengineered product. We find overengineering in real world products, as well as in software. It is a relevant concept in data science as well. First of all, because software engineering is very much a part of data science. We should be careful not to create dashboards, reports and other products that are too complex and contain more information than the user can stomach. But maybe there is a second, more subtle lesson, in overengineering for data scientists. We might create machine learning models that predict too well. Sounds funny? Let me explain what I mean by it.
In machine learning, theoretically at least, there is an optimal model give the available data in the train set. It is the one that gives the best predictions on new data, is the one that has just the right level of complexity. It is not too simple, such that it would miss predictive relationships between feature and target (aka is not underfitting), but it also not so complex that it incorporates random noise in the train set (aka is not overfitting).The golden standard within machine learning is to hold out a part of the train set to represent new data, to gauge where on the bias-variance continuum the predictor is. Either by using a test set, by using cross-validation, or, ideally, using both.
Machine learning competitions, like the ones on Kaggle, challenge data scientists to find the model that is as close to the theoretical optimum as possible. Since different models and machine learning algorithms typically excel in different areas, oftentimes the optimal result is attained by combining them in what called an ensemble. Not seldom are ML competitions won by multiple contestants who joined forces and combined their models into one big super model.
In the ML competition context, there is no such thing as “predicting too well”. Predicting as well as you can is the sheer goal of these competitions. However, in real-world applications this is not the case, in my opinion. There the objective is (or maybe should be) creating as much business value as possibles. With this goal in mind we should realize that optimizing machine learning models comes with costs. Obviously, there is the salary of the data scientist(s) involved. As you come closer to the optimal model, the more you’ll need to scrape for improvement. Most likely, there will be diminishing returns on the time spent as the project progresses in terms of gained prediction accuracy.
But costs can also be in the complexity of the implementation. I don’t mean the model complexity here, but the complexity of the product as a whole. The amount of code written might increase sharply when more complex features are introduced. Or using a more involved model might require the training to run on multiple cores or will increase the training time by, say, fivefold. Making your product more complex makes it more vulnerable for bugs and more dificult to maintain in the future. Although the predictions of a more complex model might be (slightly) better, it’s business value might actually be lower than a simpler solution, because of this vulnaribility.
The strange-sounding statement in the introduction of this blog “We might create machine learning models that perform too well”, might make more sense now. Too much time and money can be invested, creating a product that is too complex and performs too well for the business needs it serves. With other words, we are overengineering the machine learning solution. We might say that we are overpredicting.
There are at least two ways that will help you not to overengineer a machine learning product. First of all, by building a product incrementally. Probably no surprise coming from a proponent of working in an agile way, I think starting small and simple is the way to go. If the predictions are not up to par with the business requirements, see where the biggest improvement can be made in the least amount of time adding the least amount of complexity to the product. Then, assess again and start another cycle if needed. Until you arrive at a solution that is just good enough for the business need. We could call this Occam’s model, the simplest possible solution that fulfills the requirements.
Secondly, by realising that the call if the predictions are good enough to meet business needs is a business decision, not a data science choice. If you have someone on your team who is responsible for allocation of resources, planning, etc. (PO, manager, business lead, however they is called), it should be predominantly their call if there is need for further improvement. The question of these people to data scientists is too often “Is the model good enough, already?”, where it should be “What is the current performance of the model?”. As a data scientist, in the midst of optimisation, you might not be the best judge of good enough. Our ideas for further optimisation and general perfectionism could cloud our judgement. Rather, we should make it our job to inform the business people as best as we can about the current performance, and leave the final call to them.