Machine learning (ML) and AI has become the new buzz word in town. With that being said, there is a lot of demand for data scientists and machine learning engineers across various industries including IT, telecom, automotive, manufacturing and many more. Today, there are hundreds to thousands of machine learning online courses that are being offered that teach folks from different domains to use machine learning in their day today activities. Most of these courses have completely focused on the application side of the algorithms while ignoring suitability of the model selection for each application . Many forget that most machine learning models are based on statistical theory and focus narrowly on accuracy metrics of the models. This phenomenon has led to a ripple effect where, models perform extremely well in lab but, fail to perform in real world.
To give you an example, one of the simplest data mining and clustering algorithms is K-means clustering. This is also one of the most popular algorithms in ML domain with over 4000 posts in stackoverflow . There are at least 10 variations of k-means models that are available today. k-means is used in various applications such as predictive maintenance, big data analytics, image processing, risk assessment and many as such . There are assumptions to this k-means model that has to be met before we can use it to cluster data such as
- The algorithm assumes that the variable distribution for variance is spherical.
- All the features have the same variance (hence, data is scaled before clustering).
- The probability of each cluster is the same. This means that each cluster has an equal number of observations
If any of the assumptions are unmet in your data, then you might end up with a bad model or result. In few extreme cases, the results might be accurate but, in production it might fail. This is just one model but, there are various clustering models such as c-means, Gaussian mixture model, spectral clustering, DB scan etc. Unlike K-means, there are few models where there are no assumptions to use the model and it becomes challenging to judge the integrity of model’s result. Luckily, we can fall back on statistical theory to test the model and its results and be 100% confident before we push it to production (and skip the embarrassment).
I have developed a R-package called clusterTesting just to do that. The package uses data and cluster information to create a sample size for analysis. Then a normality test is performed to see if the data is normally distributed. If it is, then a parametric approach is uses to test the validity of the results. If not, then a non-parametric test is used to test the integrity of our clustering results .
All the installation instructions and examples are provided on Readme.md file of my repository. Feel free to use it and do let me know if you have questions or recommendations.
Follow me on Github: https://github.com/nagdevAmruthnath