# An analysis of the Stackoverflow Beta sites

**Why? » R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson) #List of current SO beta sites sites = c("stats", "math","programmers", "webapps", "cooking", "gamedev", "webmasters", "electronics", "tex", "unix", "photo", "english", "cstheory", "ui", "apple", "wordpress", "rpg", "gis", "diy", "bicycles", "android") sites = sort(sites) #Create empty vectors to store the downloaded data qs = numeric(length(sites)); votes = numeric(length(sites)); users = numeric(length(sites)); views = numeric(length(sites)); #Go through each site and download summary statistics for(i in 1:length(sites)){ stack_url=paste("http://api.", sites[i], ".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA", sep="") z = gzcon(url(stack_url)) y = readLines(z) sum_stats = fromJSON(paste(y, collapse="")) qs[i] = sum_stats$statistics[[1]]$total_questions votes[i] = sum_stats$statistics[[1]]$total_votes users[i] = sum_stats$statistics[[1]]$total_users views[i] = sum_stats$statistics[[1]]$views_per_day close(z) cat(sites[i],"n") }

For each of the twenty-one sites, we now have information on the:

- number of questions;
- number of votes;
- number of users;
- number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame df = data.frame(votes, users, views, qs) #Calculate the PCs PC.cor = prcomp(df, scale=TRUE) scores.cor = predict(PC.cor) plot(scores.cor[,1], scores.cor[,2], xlab="PC 1",ylab="PC 2", pch=NA, main="PCA analysis of Beta SO sites") text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:

Main features:

- Most sites are similar with the big except of programming and possibly webapps.
- Programming is different due to the large number of votes. They have twice as many votes as next highest site.
- webapps (and math) are different due to the large number of questions.

## Some more details

In case anyone is interested, the weightings you get from the PCA are:

```
#PC1 is a simple average
> round(PC.cor$rotation, 2)
PC1 PC2 PC3 PC4
votes 0.54 -0.31 -0.17 -0.77
users 0.51 0.18 0.83 0.10
views 0.50 -0.53 -0.27 0.63
qs 0.44 0.77 -0.45 0.10
```

**leave a comment**for the author, please follow the link and comment on their blog:

**Why? » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.