In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.
At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.
First, we need to use the stackoverflow api to download some summary statistics for each site:
library(rjson)
#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
"gamedev", "webmasters", "electronics", "tex", "unix",
"photo", "english", "cstheory", "ui", "apple", "wordpress",
"rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)
#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));
#Go through each site and download summary statistics
for(i in 1:length(sites)){
stack_url=paste("http://api.",
sites[i],
".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA",
sep="")
z = gzcon(url(stack_url))
y = readLines(z)
sum_stats = fromJSON(paste(y, collapse=""))
qs[i] = sum_stats$statistics[[1]]$total_questions
votes[i] = sum_stats$statistics[[1]]$total_votes
users[i] = sum_stats$statistics[[1]]$total_users
views[i] = sum_stats$statistics[[1]]$views_per_day
close(z)
cat(sites[i],"n")
}
For each of the twenty-one sites, we now have information on the:
- number of questions;
- number of votes;
- number of users;
- number of views.
An easy “starter for ten” in terms of analysis, is to do some quick principle components:
#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)
#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)
plot(scores.cor[,1], scores.cor[,2],
xlab="PC 1",ylab="PC 2", pch=NA,
main="PCA analysis of Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)
This gives the following plot:
Main features:
- Most sites are similar with the big except of programming and possibly webapps.
- Programming is different due to the large number of votes. They have twice as many votes as next highest site.
- webapps (and math) are different due to the large number of questions.
Some more details
In case anyone is interested, the weightings you get from the PCA are:
#PC1 is a simple average
> round(PC.cor$rotation, 2)
PC1 PC2 PC3 PC4
votes 0.54 -0.31 -0.17 -0.77
users 0.51 0.18 0.83 0.10
views 0.50 -0.53 -0.27 0.63
qs 0.44 0.77 -0.45 0.10
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).