An analysis of the Stackoverflow Beta sites

November 1, 2010
By

(This article was first published on Why? » R, and kindly contributed to R-bloggers)

In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson)
#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
             "gamedev", "webmasters", "electronics", "tex", "unix",
             "photo", "english", "cstheory",  "ui", "apple", "wordpress",
             "rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)

#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));

#Go through each site and download summary statistics
for(i in 1:length(sites)){
  stack_url=paste("http://api.",
                   sites[i],
                   ".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA",
                   sep="")
  z =  gzcon(url(stack_url))
  y = readLines(z)
  sum_stats = fromJSON(paste(y, collapse=""))
  qs[i] = sum_stats$statistics[[1]]$total_questions
  votes[i] = sum_stats$statistics[[1]]$total_votes
  users[i] = sum_stats$statistics[[1]]$total_users
  views[i] = sum_stats$statistics[[1]]$views_per_day
  close(z)
  cat(sites[i],"n")
}

For each of the twenty-one sites, we now have information on the:

  • number of questions;
  • number of votes;
  • number of users;
  • number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)

#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)

plot(scores.cor[,1], scores.cor[,2],
     xlab="PC 1",ylab="PC 2", pch=NA,
     main="PCA analysis of  Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:
Main features:

  • Most sites are similar with the big except of programming and possibly webapps.
  • Programming is different due to the large number of votes. They have twice as many votes as next highest site.
  • webapps (and math) are different due to the large number of questions.

Some more details

In case anyone is interested, the weightings  you get from the PCA are:

#PC1 is a simple average
> round(PC.cor$rotation, 2)
PC1     PC2     PC3     PC4
votes   0.54   -0.31   -0.17   -0.77
users   0.51    0.18    0.83    0.10
views   0.50   -0.53   -0.27    0.63
qs      0.44    0.77   -0.45    0.10


To leave a comment for the author, please follow the link and comment on his blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.