Two weeks ago, I launched my first very own competition on the Kaggle platform, the goal of which is to design engaging data science notebooks with great visuals, narration, and insights. Today, at the half-way point of the competition, I want to write about my motivations for creating this challenge, the underlying data, and my criteria of what makes a captivating and accessible notebook in the R or Python languages. All of this is strongly connected to what I see as one of the most underrated skills in the fields of data science (DS) and machine learning (ML): the ability to effectively communicate your approach and your findings.
But let’s start at the beginning.
A Reason to Celebrate
Over the last two years, I managed to keep up a cadence of posting about three underrated notebooks per week; every week. On each Tuesday, without fail, I would publish a new episode of Hidden Gems on Kaggle and on social media. I’ve learnt a lot during this time from all the notebooks that I’ve reviewed. And reading new and interesting works has become an integral part of my week.
There were notable milestones on the way: for episode 50 I’ve created a dedicated Kaggle dataset of all my Hidden Gems episodes together with a Starter Notebook written in tidyverse R. The notebook also drew connections to the comprehensive Meta Kaggle dataset, where Kaggle publishes all the meta data about the community.
And on Tuesday April 5th 2022 my little series reached its 100th episode! I’m still pretty amazed at this significant milestone; and at the fact that Kagglers have continued to visit and appreciate my episodes throughout those one hundred weeks. This definitely called for a celebration.
I much prefer to celebrate with my fellow Kagglers, so on the very same day I launched a surprise competition aimed at exploring the Hidden Gems data and its connections to related datasets. Over the preceding weeks and months, I had managed to put together a stellar panel of judges to evaluate the submissions and to obtain cool swag prizes (and more!) from Kaggle itself and from the Weights & Biases platform. Thanks so much for sponsoring this!
I also want to emphasize that in the design of this competition I drew on inspiration from the different judges, especially from Sanyam Bhutani, the host of the popular Chai Time Data Science podcast, who had hosted his own Notebooks competition back in 2020. I got a lot of ideas from that competition and I’m grateful to Sanyam and Rohan Rao for organising it and inviting me to the panel of judges back then.
The main goal of the Hidden Gems competition is to help people learn and practice how to craft engaging and insightful DS & ML presentations. All contributions are guaranteed to receive an evaluation and constructive feedback from a panel of highly experienced Kaggle community members. Below this overview infographic I share instructions on how to join, as well as my advice on the important elements of a successful notebook.
Evaluation Criteria: What makes a great notebook?
This is not a typical Kaggle competition with predictions that are scored on a leaderboard. Instead, submissions are Kaggle Notebooks. Anyone can participate simply by creating a public Notebook on the Hidden Gems dataset. All submission will be evaluated by a panel of expert judges on five specific criteria.
These are the criteria which I see as the cornerstones of communicating accessible ML & DS insights. Here is a list explaining each criterion, in no particular order:
Quality of data visuals: How clean, well designed, and approachable are the visualisations? Does the type of visualisation match the kind of insight that is being communicated? Are there consistent styles or colour schemes?
Narration & storytelling: Is the analysis well explained and documented to allow the reader to understand insights and their context? Does the notebook follow an engaging flow? Are there certain narrative tools that make the work stand out?
Structure & presentation: Are the different parts of the notebook well defined? How do those parts relate to one another? Is the code clean? Are the key insights centered, but the code and context easily accessible?
Quality of insights: Are the insights relevant, useful, and actionable? Does the notebook tell us something new, unexpected, or counter intuitive? Does the work contain recommendations or predictions for the future?
Creativity & originality: How novel and inventive are the ideas and approach? Am I positively surprised by some stylistic choices or discovered findings? Does the author have their own style, and perhaps take some design risks in expressing it.
You might notice that those criteria have a strong overlap with best practices in data journalism. Which makes sense, because the goals are very similar: to clearly present a cohesive narrative that is insightful and accessible to an audience. Like for many ways of communicating, the details here depend on who your audience is and what kind of findings your are communicating. The Kaggle community is likely more technically inclined than most, so you could decide to put a strong emphasis on well documented data wrangling or model architectures. Or you could hide all the code in your notebook and rely on a strong narrative with a few hand-crafted visuals. Like in a good piece of popular data journalism.
How to join
If you share my passion for dataviz and storytelling, then you might now be interested to know how you can join this competition to practice your skills and win some prizes. You might even consider to join the Kaggle platform just to compete in this challenge, in which case I’d be very honoured. This might be a good time to mention that there is a special prize category for people who will have written their very first Kaggle notebook.
To join, go to the dataset page. At the very top, you will find a button that says “New Notebook”. Click that, and choose between Python or R (and R Jupyter notebook or Rmarkdown). You will find yourself in the Kaggle editor, where the dataset is already available, and start exploring.
If you would like to watch some examples of what can be done with the dataset and how to use the Kaggle editor, here are a few live streams done by the members of the judge panel:
The launch stream by Sanyam Bhutani and myself – where we explain the background and setup of the competition in detail. I go through my starter notebook, and Sanyam shows how to connect to the Weights & Biases platform.
Stream on Abhishek Thakur’s youtube channel – focussing on overall EDA advice and how to write engaging notebooks. We answer quite a few questions from the audience and talk about the notebook that won the first weekly price.
Live coding with Rob Mulla on twitch – a more in-depth chat about motivations and various EDA techniques in R and Python. I’m explaining my R starter notebook and Rob starts a live EDA in Python.
Live coding EDA by Andrada Olteanu hosted by Sanyam Bhutani – full of great advice for notebook design and dataviz crafting. Andrada starts a live EDA in Python.
The Kaggle editor works pretty well for Jupyter-style notebooks. When it comes to Rmarkdown content, I prefer to build it locally in Rstudio and then copy/paste my work into the Kaggle environment and run it there (button “Save Version”). This way, you can take advantage of all the Rstudio tools.
Hopefully this post got you interested in joining the ongoing competition to practice your skills when it comes to EDA and notebook design. I would love to see your ideas and creativity! And to reward them with some cool swag prizes.