The 1000 most-visited sites analyzed using R

Posted on June 5, 2010 by C in R bloggers | 0 Comments

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ever wondered about which Computer and Electronics web sites get the most Page Views? Based upon data recently published by Google:

The R program to create this graph is as follows:

library(XML)

# URL for the Google Data

u=”http://www.google.com/adplanner/static/top1000/”

tables = readHTMLTable(u)

l=tables[[2]]

# Name the columns

colnames(l)=c(‘Rank’,’Site’,’Category’,’Users’,’Reach’,’Views’,’Advertising?’)

# Extract Computers and Electronics Subset

CandE=l[l$Category==’Computers & Electronics’, c(‘Site’,’Views’)]

rownames(CandE)=CandE[,1]

CandE=CandE[-1]

CandE$Views=as.numeric(gsub(‘,’,”,CandE$Views))

CandE[,1]=CandE[order(CandE[,1],decreasing=TRUE),]

par(las=2, mar=c(12, 10, 1, 2) + 0.1)

barplot(t(as.matrix(CandE)),yaxt = “n”, ylab = “”, main=”Top Computers & Electronics Sites”, col=”orange”)

axis(side = 2, scientific = FALSE, big.mark = “,”)

Discussion and Interactive Commands…

Google recently posted the 1000 most-visited sites on the web. The data is displayed in tabular format, but is a bit unwieldy to view interactively. Sounds like a candidate for some analysis using R and the XML package! I was very impressed how easy it is to scrape the relevant data and produce meaningful summarizations with only a few lines of code.

library(XML)

u=”http://www.google.com/adplanner/static/top1000/”

tables = readHTMLTable(u)

l=tables[[2]]

colnames(l)=c(‘Rank’,’Site’,’Category’,’Users’,’Reach’,’Views’,’Advertising?’)

These few lines of code read in the data available at the site and assigns column names. To see a sample of the data we now have available:

head(l)

Both numerical and categorical data is available in the table. For example, each site is categorized by whether or not they include advertising.

summary(l$Advertising)

68% of the sites listed do advertise:

ad=summary(l$Advertising)

(ad[1] / ad[2]) * 100

Plotting subsets of the data is probably the best way to go – but to start, I wanted to see which categories of sites appear most frequently on the list.

par(las=2, mar=c(4, 12, 1, 2) + 0.1)

barplot(sort(summary(l$Category), decreasing=FALSE), horiz=TRUE, cex.names=0.6)

I realize that this is a impossible to read – you will need to run it yourself. There are 216 unique categories.

length(unique(l$Category))

The Top 10 are:

Other
Web Portals
Social Networks
Online Games
File Sharing & Hosting
Newspapers
Blogging Resources & Services
News & Current Events
Computers & Electronics
Search Engines

I’ll leave subsequent analysis to you. To limit yourself to specific columns:

l[,c(‘Site’,’Users’,’Views’,’Reach’)]

You probably will want to begin munging the numeric data, and will need to do something to interpret the string values as numbers.

l$Views=gsub(‘,’,”,l$Views)

l$Reach=gsub(‘%’,”,l$Reach)

l$Users=gsub(‘,’,”,l$Users)

Having done so, you will be able to plot them…

plot(l$Reach)

plot(l$Users)

It might be more interesting to focus on sites in a particular category. For example, if your niche is Cooking and Recipes:

l[l$Category==’Cooking & Recipes’,c(‘Site’,’Users’,’Views’,’Reach’)]

And for sites dedicated to the Java programming language:

l[l$Category==’Java’, c(‘Site’,’Users’,’Views’,’Reach’)]

I’d love to see your ideas for analyzing this data in the comments…it’s a great opportunity to show off your analytical and R skills!

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The 1000 most-visited sites analyzed using R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)