Intro + hackathon description
Have you ever participated in a hackathon? A month ago my answer was still no (despite the fact that I’ve been working in the R Shiny/Data Science space for 5 years). Now I have checked it off my list. And my team won, so I’m 1 for 1, with a 100% success rate, so I can retire now. More seriously, the team and I made tons of mistakes and just a few good decisions and we think that it would be beneficial to share the lessons learned, both from the coding and work organisation perspective.
What was the hackathon about? It was organised by the Analyx company from Poznań city, and held in their office. A great job done on their side! The problem to be solved was as follows: in 2018 one of the tourist organisations published “popularity index” of the Polish cities, and Poznań got a low score, especially in comparison to similar Polish cities e.g. Wrocław. That confirmed what city authorities had seen for a long time: that Poznań is not as attractive to tourists as it should be. So, Poznań’s office of promotion gathered all of the available data from the Central Statistical Office of Poland and gave it to the hackathon participants with an expected result: tell us what to do to make Poznań great again, especially in how it’s perceived by tourists.
The general task was divided into several parts that were assessed by the jury:
- Some values of “popularity index” were missing – teams should build a model to predict them, the MAE was compared
- Reproducibility and code organisation
- Interpretability of the model
- Value of conclusions for the city of Poznań
- The quality of the presentation at the end of the hackathon
- An approach to solving side-quest: plan the trip through 5 polish cities for a fictional family based on their preferences
There were also two nice concepts introduced by the organisers. They made the work more fun and emphasized the crucial concept in data science – you need to look for helpful data around you:
- each team was allotted 7 “points” that could be spent on additional datasets, and each had his own “price”. But you needed to choose wisely, for you could only afford some of them! So eventually each team had different data to use according to their brilliant (or ill-advised) ideas.
- the names of the Polish cities in dataset were disguised as fruit names! So part of the game was to decode the names – it turned out that the ‘lemon’ was a nickname for ‘Warsaw’ 🙂
How we organised our work
So, we rolled up our sleeves and got to work! We had nine hours to complete the tasks. There was no time to lose! We decided to call our team lotR, after our favorite band of in-over-their-heads adventurers.
But we really had no idea where to start. Someone said, “I don’t know, maybe we should look at the data.” Then the question became “Wait, how do we get the data?” But we figured it out and were very proud for the first success of the day! 😉
We realized that we didn’t have much experience with hackathons together as a team, so we needed to organise the way we work. What developed quite naturally was the model of hierarchical work splitting: we started by just looking at the data and getting familiar with it all together. Then we formed two groups: one focused on decoding the fruit names plus building the trip planner app and the second group focused on building the predictive model plus its interpretability. Each group then divided into sub-groups (well, consisting of only 1 person) that wrote code for a specific area. We kept everyone informed of our progress and helped each other with issues.
I’m sure that sounds great, but 2 hours into the hackathon I was honestly afraid that we would have nothing to show for our efforts, and we would have to get up on stage after 9 hours and admit our failure. But we eventually achieved some results, so lesson learned: do not give up!
View this post on Instagram
There was one obstacle with communication: we shared a room with a competing team! It caused some mistrust at the beginning. After a whole day together though we get along and it was actually a great idea for networking and building relations. We honestly kept our fingers crossed for the success of the other team during the presentations.
Our delivered solution
We started slowly, very unorganised, and lost a lot of time to find the optimal way to operate technically. So here is the lesson learned: set up a repo earlier than later, a Slack channel, and a common environment. Integrate the tools that might be useful: we struggled with H2O on Mac and H2O <=> DALEX integration (and it was totally expected that it will be used, as prof. Przemysław Biecek was on the jury).
I previously mentioned H2O. As the prediction error was just a part of the final score we decided to not overthink the solution and use autoML for R. It turned out to be the best strategy: our model was good enough (we had the 3rd best result in this section) and it took us just one click to generate the model and predictions. We used the saved time to improve the interpretability of the model and craft the interesting recommendations.
But please don’t think that no work is needed when using such automated solutions. What we discovered is that the solution is quite vulnerable for feature selection. And in our case that was a major issue: in the dataset there were 59 observations with ‘popularity index’ values and 7 to be predicted, but over 570 variables! Those were mostly financial budgets of each city per category and counts of tourist facilities, like hotels and restaurants. We took the following steps, and reducing the number of variables in each step improved our autoML model:
- Removed all of the variables that had 0 variance (had the same values for all observations; usually all equal to 0). After this step were left with 470 variables.
- Removed all highly correlated variables using the caret::findCorrelation function. We got down to 426 variables.
- Removed all unimportant variables using caret::rfeControl – the process of feature selection based on recursive feature elimination with random forests. We end up with 16 variables.
- Added important variables from datasets that we “bought” with our “points”: social media comments about Poznań and reviews from travelling portal. Our final dataset had 30 variables and gave the best results.
We were quite surprised that such an advanced solution as autoML that produces comparable results of super advanced deep learning models in just a few minutes does not have a step like removing the correlated variables.
The biggest value of the model was to find out why a city was popular or not. We spent more time on interpretability. We discovered that public spending on culture and education by a city correlated with its popularity with tourists. Other correlations include the number and quality of reviews on Tripadvisor and Twitter.
So one of our recommendations to Poznań was to encourage small businesses there to get active on social media platforms.
The lotR team also recommended that Poznań enhance the cultural side of the city. Poznań doesn’t have natural attractions like mountains or a seaside. There are some historical points of interest, such as a cathedral with the graves of Poland’s first kings. But there are a number of massive events like concerts and picnics. (We later discovered that Poznan is still pretty happening even at 2:30 AM. As people from the “Lemon” city, we were surprised to see that).
What was our big plus was the “Trip Planner” in the form of a basic Shiny app. It allows the solution to be easily and user-friendly reproducible for different sets of family preferences. The purpose of the app is to recommend road trips for families. It calculates the average preferences of the family members and finds the 5 best destinations within 1500 km. It captures the family members’ interests with a questionnaire and compares their interests with the offerings of the various cities. In short — it’s an app for family road trip ideas!
Results – lessons learned
The presentation part was done by me because the other guys on the team were busy drinking beers in the audience. Ok, to be honest, we made this decision to prevent the chaos of having four presenters on stage at the same time with little prep time, and it turned out to be a good choice. So we recommend assigning one person to give the final presentation.
Next, don’t worry about the details, just focus on what is useful. Nine hours goes by quickly in this type of event. Focus on what is actually assessed! We noticed that the model was only worth 20% of the total score, so we found a pre-made model and used that, leaving us time to focus on other tasks. We were surprised by the technical issues that we encountered. So it’s wise to prepare your work environment and tools before the hackathon starts.
R Shiny allowed us to develop a decent looking solution in a short amount of time. We were the only team to use it, so maybe it gave us a little edge. 🙂
It was a great experience and got us out of our typical coding environment. After the competition, the city center was still pretty happening, with crowds everywhere. Poznań really looked like a “best kept secret” type of city.
Do you have your own hackathon hacks? Please add them in the comments below.
Follow Appsilon Data Science on Social Media
- Follow @Appsilon on Twitter
- Follow us on LinkedIn
- Sign up for our newsletter
- Try out our R Shiny open source packages