Hacker News User Base Changed?

July 26, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

There are lots of references on Hacker news to the fact that the "good old days" are gone and that the character of the site has changed since it started.  The visualization above was based on a sample of users that posted on the site in recent times.  The data was gathered by iterating over the first 1000 pages and gleaning a list of user names.  The users ages were then checked and are plotted above.

The Chart's Meaning
Note that the chart does not represent the number of posts by a given user, it is just a list of distinct users with their start dates grouped in monthly buckets.  I suppose that the shape of the graph makes sense - folks sign up so that they can post, and older users drift away and cease posting at some point.  The chart does indicate that - as of a few days ago - a a given user who posted was more likely to be someone who signed up in the last year or two than a veteran.

Scraping the Data
I used Ruby and Hpricot (still missing you _why) to parse the site and Active Record to store the list of users in a MySQL database.  I use ActiveRecord outside of rails rather frequently.  It does great straightforward object to relational mapping - and even an arbitrary query is returned as an object that can be manipulated.

Noticed a couple of differences using MySQL vs Oracle/RODBC with R.

1)  Oracle/RODBC capitalizes column names in the result set.
2)  Using RMySQL, there is no need to set up a ODBC connection.
3)  RMySQL requires two steps - and execution of the query followed by a deliberate fetch.  Oracle/RODBC does this in a single step.  As pointed out in the comments below there is
function dbGetQuery that allows both actions to be taken in a single step.
4)  I use TRUNC in Oracle - but ended up using the EXTRACT function and tagging on a 01 for the first day of the month with MySQL.

Creating the Chart
R speaks for itself:



library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, username='xxxx',password='xxxx',dbname='xxxx')

# Buckets by month
sql='select extract(YEAR_MONTH from hn_created_date) hn_created_date, count(*) from users group by extract(YEAR_MONTH from hn_created_date);'

# Execute the Query and Fetch the Data
rs <- dbSendQuery(con,sql)
df <- fetch(rs)

# Set the date to the first of the month (buckets of user by start month)
df$hn_created_date = as.Date(paste(df$hn_created_date,'01',sep=''),format='%Y%m%d')

# The Actual Plot
p=ggplot(data=df, aes(hn_created_date, df$`count(*)`))+geom_line()+xlab('User Start Date')+ylab('Number of Users Who Posted recently')
p+stat_smooth()



To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.