[social4i size=”large” align=”float-right”]
It’s officially winter, so what could be better than drinking hot chocolate while querying the new Stack Overflow dataset in BigQuery? It has every Stack Overflow question, answer, comment, and more — which means endless possibilities of data crunching. Inspired by Felipe Hoffa’s post on how response time varies by tag, I wanted to look at the comments table (53 million rows!).
The happiest Stack Overflow tags 🙂
To measure happy comments I looked at comments with “thank you”, “thanks”, “awesome” or “:)” in the body. I limited the analysis to tags with more than 500,000 comments. Here’s the query:
#standardSQL SELECT tag, ROUND((COUNT(case when comment_text like '%thanks%' or comment_text like '%:)%' or comment_text like '%thank you%' or comment_text like '%awesome%' then 1 end) / COUNT(*)) * 100,2) as percent_happy, COUNT(*) total_comments FROM ( SELECT LOWER(a.text) as comment_text, SPLIT(b.tags, '|') as tags FROM `bigquery-public-data.stackoverflow.comments` a JOIN `bigquery-public-data.stackoverflow.posts_questions` b ON a.post_id = b.id UNION ALL SELECT LOWER(b.text) as comment_text, SPLIT(c.tags, '|') as tags FROM `bigquery-public-data.stackoverflow.posts_answers` a JOIN ( SELECT post_id, text FROM `bigquery-public-data.stackoverflow.comments` ) b ON a.id = b.post_id JOIN `bigquery-public-data.stackoverflow.posts_questions` c ON c.id = a.parent_id ), UNNEST(tags) tag GROUP BY 1 HAVING total_comments > 500000 ORDER BY percent_happy DESC
Here’s the result in BigQuery:
And the chart:
R, Ruby, HTML / CSS, and iOS are the communities with the happiest commenters according to this list. People who ask questions about XML and regular expressions also seem particularly thankful for help. If you’re curious, here are the 15 highest scoring happy comments that were short enough to fit in a screenshot (and their associated tags) :
But because people sometimes get angry on the internet, you’re probably wondering…
The angriest Stack Overflow tags 🙁
For angry comments, I counted those with “wrong”, “horrible”, “stupid”, or “:(” in the body. The SQL is the same as above with the search terms swapped out. Here’s the result:
And the chart:
Clearly the angriest comments are those related to C derivatives. Many programming concepts also wound up here: multithreading, arrays, algorithms, and strings. And here are the highest scoring angry comments:
This analysis is not perfect, as the comment “that one’s so stupid it underflows and becomes awesome” appears in both lists. That’s where a machine learning tool like the Natural Language API would come in handy.
Between the two lists there were only a few tag overlaps. The most excitable tags (I’m interpreting tags that showed up in both the happy and angry list as ‘excitable’) are: ios, iphone, objective-c, and regex questions. And while the internet may seem like a dark place sometimes, there appears to be roughly six happy comments for every angry one.
Dive into the Stack Overflow dataset, or check out some of these awesome posts to get inspired:
- Stack Overflow dataset announcement
- Playing with Stack Overflow on BigQuery
- Always end your questions with a ‘?’
If you have comments or ideas for future analysis, find me on Twitter @SRobTweets.