Hacker News Analysis

[This article was first published on Edwin Chen's Blog » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of my findings.

Activity on the Site

My first question was: how has activity on the site increased over time? I looked at number of posts, points on posts, and comments on posts.


Hacker News Posts by Month

This looks like a strong linear fit, with an increase of 292 posts every month.


For comments, adding a quadratic term proved significant, so I used a quadratic regression to fit the number of comments by month.

Hacker News Comments by Month


Again, a quadratic regression was a better fit for points by month:

Hacker News Points by Month

Points and Comments

My next question was how points and comments related. Do, say, posts with more points also have more comments?

First, I plotted the points and comments of each individual post:

All Points vs. Comments

There is an overall positive correlation between points and comments (as expected), and interestingly, there are quite a few high-points posts with no comments.

Let’s try cleaning up the plot, by taking the median number of comments per points level (and removing posts at the higher end, where we have little data):

Points vs. Median Comments

We see that posts with more points do tend to have more comments. Also, variance in number of comments is indicated by size and color, so (unsurprisingly) we see that posts with more points have larger variance in their number of comments.

Quality of Posts

Another question was whether the quality of posts has degraded over time.

To estimate quality, I defined a “good” post as a post with points greater than x / 10, where x is the number of points of the tenth-highest rated post in the same month. I chose the tenth-highest rated post, because it provided a fairly stable baseline (unlike choosing the highest rated post):

Points of Tenth Rated Posts

We see that while the overall percentage of quality posts has decreased over time:

Percent of Quality Posts

The absolute number of quality posts has increased:

Number of Quality Posts

So Hacker News has probably gotten worse if you like to read every single post, but better if you only like to read the front page.

Company Trends

Also, I wanted to see how certain topics have trended over time, so I looked at how mentions of some of the big-name companies (Google, Facebook, Microsoft, Yahoo, Twitter) have changed. For each company, I plotted the percentage of posts with the company’s name in the title, and also made a smoothed plot comparing all five at the end. Note that Microsoft, Yahoo, and Google all seem to be trending slightly downward.

Mentions of Microsoft

Mentions of Yahoo

Mentions of Google

Mentions of Facebook

Mentions of Twitter

All Trends

To leave a comment for the author, please follow the link and comment on their blog: Edwin Chen's Blog » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)