Last week, We talked all about Artificial Intelligence (also Artifical Stupidity) which led me to think about the foundation of Data Science that's the Data itself. I think, Data is the least appreciated entity in the Data Science Value chain. You might agree with me, If you do Data Science outside Competitive Platforms like Kaggle where Data given to you is what most of the Data Scientists dream about in their jobs.
“One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research. People really recognize the importance the dataset is front and center in the research as much as algorithms.” – Fei-Fei Li
Meanwhile, Venture Capitalists aren't shying away from putting their money where Data is created and curated – Recently, silicon-valley startup Scale AI has hit the unicorn status. Scale AI's about us page reads:
The Data Platform for AI
Scale AI has also open-sourced Datasets and That's sweet.
Build your own Data
Zalando that open-sourced Fashion-MNIST published a nice paper that listed out the steps they took to publish the dataset. There are also free tools like labelImg and makesense.ai to help you annotate images for a typical Image dataset. For NLP Annotation, BRAT is a nice free open-source tool. And, If you are planning for a pet project and don't have the required dataset this tutorial by Mat Kelcey of counting bees on a rasp pi with a conv net would be a tremendous help.
In R, Check out this to learn How to generate meaningful fake data for learning, experimentation and teaching using
That said, If you appreciate Data Science as much as you'd appreciate the beauty of a Ferrari or Lamborghini, then you might also have to remind you that car is only useful if you've got the oil in it which is your super-clean labelled Data that's usable for Data science and Machine Learning.
If you liked this, Please subscribe to my Data Science Newsletter and also share it with your friends!