The 1000 most-visited sites analyzed using R

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ever wondered about which Computer and Electronics web sites get the most Page Views? Based upon data recently published by Google:

The R program to create this graph is as follows:

library(XML)

# URL for the Google Data
u=”http://www.google.com/adplanner/static/top1000/”
tables = readHTMLTable(u)
l=tables[[2]]

# Name the columns
colnames(l)=c(‘Rank’,’Site’,’Category’,’Users’,’Reach’,’Views’,’Advertising?’)

# Extract Computers and Electronics Subset
CandE=l[l$Category==’Computers & Electronics’, c(‘Site’,’Views’)]
rownames(CandE)=CandE[,1]
CandE=CandE[-1]
CandE$Views=as.numeric(gsub(‘,’,”,CandE$Views))
CandE[,1]=CandE[order(CandE[,1],decreasing=TRUE),]
par(las=2, mar=c(12, 10, 1, 2) + 0.1)
barplot(t(as.matrix(CandE)),yaxt = “n”, ylab = “”, main=”Top Computers & Electronics Sites”, col=”orange”)
axis(side = 2, scientific = FALSE, big.mark = “,”)


Discussion and Interactive Commands…

Google recently posted the 1000 most-visited sites on the web. The data is displayed in tabular format, but is a bit unwieldy to view interactively. Sounds like a candidate for some analysis using R and the XML package! I was very impressed how easy it is to scrape the relevant data and produce meaningful summarizations with only a few lines of code.

library(XML)
u=”http://www.google.com/adplanner/static/top1000/”
tables = readHTMLTable(u)
l=tables[[2]]
colnames(l)=c(‘Rank’,’Site’,’Category’,’Users’,’Reach’,’Views’,’Advertising?’)

These few lines of code read in the data available at the site and assigns column names. To see a sample of the data we now have available:

head(l)

Both numerical and categorical data is available in the table. For example, each site is categorized by whether or not they include advertising.

summary(l$Advertising)

68% of the sites listed do advertise:

ad=summary(l$Advertising)
(ad[1] / ad[2]) * 100

Plotting subsets of the data is probably the best way to go – but to start, I wanted to see which categories of sites appear most frequently on the list.

par(las=2, mar=c(4, 12, 1, 2) + 0.1)
barplot(sort(summary(l$Category), decreasing=FALSE), horiz=TRUE, cex.names=0.6)



I realize that this is a impossible to read – you will need to run it yourself. There are 216 unique categories.

length(unique(l$Category))

The Top 10 are:
  1. Other
  2. Web Portals
  3. Social Networks
  4. Online Games
  5. File Sharing & Hosting
  6. Newspapers
  7. Blogging Resources & Services
  8. News & Current Events
  9. Computers & Electronics
  10. Search Engines
I’ll leave subsequent analysis to you. To limit yourself to specific columns:

l[,c(‘Site’,’Users’,’Views’,’Reach’)]

You probably will want to begin munging the numeric data, and will need to do something to interpret the string values as numbers.

l$Views=gsub(‘,’,”,l$Views)
l$Reach=gsub(‘%’,”,l$Reach)
l$Users=gsub(‘,’,”,l$Users)

Having done so, you will be able to plot them…

plot(l$Reach)
plot(l$Users)

It might be more interesting to focus on sites in a particular category. For example, if your niche is Cooking and Recipes:

l[l$Category==’Cooking & Recipes’,c(‘Site’,’Users’,’Views’,’Reach’)]

And for sites dedicated to the Java programming language:

l[l$Category==’Java’, c(‘Site’,’Users’,’Views’,’Reach’)]

I’d love to see your ideas for analyzing this data in the comments…it’s a great opportunity to show off your analytical and R skills!

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)