# The 1000 most-visited sites analyzed using R

June 5, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

The R program to create this graph is as follows:
library(XML)

# URL for the Google Data
l=tables[[2]]

# Name the columns

# Extract Computers and Electronics Subset
CandE=l[l\$Category==’Computers & Electronics’, c(‘Site’,’Views’)]
rownames(CandE)=CandE[,1]
CandE=CandE[-1]
CandE\$Views=as.numeric(gsub(‘,’,”,CandE\$Views))
CandE[,1]=CandE[order(CandE[,1],decreasing=TRUE),]
par(las=2, mar=c(12, 10, 1, 2) + 0.1)
barplot(t(as.matrix(CandE)),yaxt = “n”, ylab = “”, main=”Top Computers & Electronics Sites”, col=”orange”)
axis(side = 2, scientific = FALSE, big.mark = “,”)
Discussion and Interactive Commands…
Google recently posted the 1000 most-visited sites on the web. The data is displayed in tabular format, but is a bit unwieldy to view interactively. Sounds like a candidate for some analysis using R and the XML package! I was very impressed how easy it is to scrape the relevant data and produce meaningful summarizations with only a few lines of code.
library(XML)
l=tables[[2]]
These few lines of code read in the data available at the site and assigns column names. To see a sample of the data we now have available:
Both numerical and categorical data is available in the table. For example, each site is categorized by whether or not they include advertising.
68% of the sites listed do advertise:
Plotting subsets of the data is probably the best way to go – but to start, I wanted to see which categories of sites appear most frequently on the list.
par(las=2, mar=c(4, 12, 1, 2) + 0.1)
barplot(sort(summary(l\$Category), decreasing=FALSE), horiz=TRUE, cex.names=0.6)

I realize that this is a impossible to read – you will need to run it yourself. There are 216 unique categories.

length(unique(l\$Category))
The Top 10 are:
1. Other
2. Web Portals
3. Social Networks
4. Online Games
5. File Sharing & Hosting
6. Newspapers
7. Blogging Resources & Services
8. News & Current Events
9. Computers & Electronics
10. Search Engines
I’ll leave subsequent analysis to you. To limit yourself to specific columns:
l[,c(‘Site’,’Users’,’Views’,’Reach’)]

You probably will want to begin munging the numeric data, and will need to do something to interpret the string values as numbers.

l\$Views=gsub(‘,’,”,l\$Views)
l\$Reach=gsub(‘%’,”,l\$Reach)
l\$Users=gsub(‘,’,”,l\$Users)
Having done so, you will be able to plot them…
plot(l\$Reach)
plot(l\$Users)
It might be more interesting to focus on sites in a particular category. For example, if your niche is Cooking and Recipes:
l[l\$Category==’Cooking & Recipes’,c(‘Site’,’Users’,’Views’,’Reach’)]
And for sites dedicated to the Java programming language:
l[l\$Category==’Java’, c(‘Site’,’Users’,’Views’,’Reach’)]

I’d love to see your ideas for analyzing this data in the comments…it’s a great opportunity to show off your analytical and R skills!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: ,