An analysis of the Stackoverflow Beta sites

[This article was first published on Why? » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson)
#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
             "gamedev", "webmasters", "electronics", "tex", "unix",
             "photo", "english", "cstheory",  "ui", "apple", "wordpress",
             "rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)

#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));

#Go through each site and download summary statistics
for(i in 1:length(sites)){
  stack_url=paste("http://api.",
                   sites[i],
                   ".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA",
                   sep="")
  z =  gzcon(url(stack_url))
  y = readLines(z)
  sum_stats = fromJSON(paste(y, collapse=""))
  qs[i] = sum_stats$statistics[[1]]$total_questions
  votes[i] = sum_stats$statistics[[1]]$total_votes
  users[i] = sum_stats$statistics[[1]]$total_users
  views[i] = sum_stats$statistics[[1]]$views_per_day
  close(z)
  cat(sites[i],"n")
}

For each of the twenty-one sites, we now have information on the:

  • number of questions;
  • number of votes;
  • number of users;
  • number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)

#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)

plot(scores.cor[,1], scores.cor[,2],
     xlab="PC 1",ylab="PC 2", pch=NA,
     main="PCA analysis of  Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:
Main features:

  • Most sites are similar with the big except of programming and possibly webapps.
  • Programming is different due to the large number of votes. They have twice as many votes as next highest site.
  • webapps (and math) are different due to the large number of questions.

Some more details

In case anyone is interested, the weightings  you get from the PCA are:

#PC1 is a simple average > round(PC.cor$rotation, 2) PC1     PC2     PC3     PC4 votes   0.54   -0.31   -0.17   -0.77 users   0.51    0.18    0.83    0.10 views   0.50   -0.53   -0.27    0.63 qs      0.44    0.77   -0.45    0.10


To leave a comment for the author, please follow the link and comment on their blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)