An analysis of the Stackoverflow Beta sites

Posted on November 1, 2010 by csgillespie in R bloggers | 0 Comments

[This article was first published on Why? » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson)
#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
             "gamedev", "webmasters", "electronics", "tex", "unix",
             "photo", "english", "cstheory",  "ui", "apple", "wordpress",
             "rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)

#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));

#Go through each site and download summary statistics
for(i in 1:length(sites)){
  stack_url=paste("http://api.",
                   sites[i],
                   ".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA",
                   sep="")
  z =  gzcon(url(stack_url))
  y = readLines(z)
  sum_stats = fromJSON(paste(y, collapse=""))
  qs[i] = sum_stats$statistics[[1]]$total_questions
  votes[i] = sum_stats$statistics[[1]]$total_votes
  users[i] = sum_stats$statistics[[1]]$total_users
  views[i] = sum_stats$statistics[[1]]$views_per_day
  close(z)
  cat(sites[i],"n")
}

For each of the twenty-one sites, we now have information on the:

number of questions;
number of votes;
number of users;
number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)

#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)

plot(scores.cor[,1], scores.cor[,2],
     xlab="PC 1",ylab="PC 2", pch=NA,
     main="PCA analysis of  Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:
Main features:

Most sites are similar with the big except of programming and possibly webapps.
Programming is different due to the large number of votes. They have twice as many votes as next highest site.
webapps (and math) are different due to the large number of questions.

Some more details

In case anyone is interested, the weightings you get from the PCA are:

#PC1 is a simple average > round(PC.cor$rotation, 2) PC1 PC2 PC3 PC4 votes 0.54 -0.31 -0.17 -0.77 users 0.51 0.18 0.83 0.10 views 0.50 -0.53 -0.27 0.63 qs 0.44 0.77 -0.45 0.10