The 1000 most-visited sites analyzed using R

June 5, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

Ever wondered about which Computer and Electronics web sites get the most Page Views? Based upon data recently published by Google:

The R program to create this graph is as follows:

library(XML)

# URL for the Google Data
u="http://www.google.com/adplanner/static/top1000/"
tables = readHTMLTable(u)
l=tables[[2]]

# Name the columns
colnames(l)=c('Rank','Site','Category','Users','Reach','Views','Advertising?')

# Extract Computers and Electronics Subset
CandE=l[l$Category=='Computers & Electronics', c('Site','Views')]
rownames(CandE)=CandE[,1]
CandE=CandE[-1]
CandE$Views=as.numeric(gsub(',','',CandE$Views))
CandE[,1]=CandE[order(CandE[,1],decreasing=TRUE),]
par(las=2, mar=c(12, 10, 1, 2) + 0.1)
barplot(t(as.matrix(CandE)),yaxt = "n", ylab = "", main="Top Computers & Electronics Sites", col="orange")
axis(side = 2, scientific = FALSE, big.mark = ",")


Discussion and Interactive Commands...

Google recently posted the 1000 most-visited sites on the web. The data is displayed in tabular format, but is a bit unwieldy to view interactively. Sounds like a candidate for some analysis using R and the XML package! I was very impressed how easy it is to scrape the relevant data and produce meaningful summarizations with only a few lines of code.

library(XML)
u="http://www.google.com/adplanner/static/top1000/"
tables = readHTMLTable(u)
l=tables[[2]]
colnames(l)=c('Rank','Site','Category','Users','Reach','Views','Advertising?')

These few lines of code read in the data available at the site and assigns column names. To see a sample of the data we now have available:

head(l)

Both numerical and categorical data is available in the table. For example, each site is categorized by whether or not they include advertising.

summary(l$Advertising)

68% of the sites listed do advertise:

ad=summary(l$Advertising)
(ad[1] / ad[2]) * 100

Plotting subsets of the data is probably the best way to go - but to start, I wanted to see which categories of sites appear most frequently on the list.

par(las=2, mar=c(4, 12, 1, 2) + 0.1)
barplot(sort(summary(l$Category), decreasing=FALSE), horiz=TRUE, cex.names=0.6)



I realize that this is a impossible to read - you will need to run it yourself. There are 216 unique categories.

length(unique(l$Category))

The Top 10 are:
  1. Other
  2. Web Portals
  3. Social Networks
  4. Online Games
  5. File Sharing & Hosting
  6. Newspapers
  7. Blogging Resources & Services
  8. News & Current Events
  9. Computers & Electronics
  10. Search Engines
I'll leave subsequent analysis to you. To limit yourself to specific columns:

l[,c('Site','Users','Views','Reach')]

You probably will want to begin munging the numeric data, and will need to do something to interpret the string values as numbers.

l$Views=gsub(',','',l$Views)
l$Reach=gsub('%','',l$Reach)
l$Users=gsub(',','',l$Users)

Having done so, you will be able to plot them...

plot(l$Reach)
plot(l$Users)

It might be more interesting to focus on sites in a particular category. For example, if your niche is Cooking and Recipes:

l[l$Category=='Cooking & Recipes',c('Site','Users','Views','Reach')]

And for sites dedicated to the Java programming language:

l[l$Category=='Java', c('Site','Users','Views','Reach')]

I'd love to see your ideas for analyzing this data in the comments...it's a great opportunity to show off your analytical and R skills!

To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.