The 1000 most-visited sites analyzed using R

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ever wondered about which Computer and Electronics web sites get the most Page Views? Based upon data recently published by Google:

The R program to create this graph is as follows:


# URL for the Google Data
tables = readHTMLTable(u)

# Name the columns

# Extract Computers and Electronics Subset
CandE=l[l$Category==’Computers & Electronics’, c(‘Site’,’Views’)]
par(las=2, mar=c(12, 10, 1, 2) + 0.1)
barplot(t(as.matrix(CandE)),yaxt = “n”, ylab = “”, main=”Top Computers & Electronics Sites”, col=”orange”)
axis(side = 2, scientific = FALSE, big.mark = “,”)

Discussion and Interactive Commands…

Google recently posted the 1000 most-visited sites on the web. The data is displayed in tabular format, but is a bit unwieldy to view interactively. Sounds like a candidate for some analysis using R and the XML package! I was very impressed how easy it is to scrape the relevant data and produce meaningful summarizations with only a few lines of code.

tables = readHTMLTable(u)

These few lines of code read in the data available at the site and assigns column names. To see a sample of the data we now have available:


Both numerical and categorical data is available in the table. For example, each site is categorized by whether or not they include advertising.


68% of the sites listed do advertise:

(ad[1] / ad[2]) * 100

Plotting subsets of the data is probably the best way to go – but to start, I wanted to see which categories of sites appear most frequently on the list.

par(las=2, mar=c(4, 12, 1, 2) + 0.1)
barplot(sort(summary(l$Category), decreasing=FALSE), horiz=TRUE, cex.names=0.6)

I realize that this is a impossible to read – you will need to run it yourself. There are 216 unique categories.


The Top 10 are:
  1. Other
  2. Web Portals
  3. Social Networks
  4. Online Games
  5. File Sharing & Hosting
  6. Newspapers
  7. Blogging Resources & Services
  8. News & Current Events
  9. Computers & Electronics
  10. Search Engines
I’ll leave subsequent analysis to you. To limit yourself to specific columns:


You probably will want to begin munging the numeric data, and will need to do something to interpret the string values as numbers.


Having done so, you will be able to plot them…


It might be more interesting to focus on sites in a particular category. For example, if your niche is Cooking and Recipes:

l[l$Category==’Cooking & Recipes’,c(‘Site’,’Users’,’Views’,’Reach’)]

And for sites dedicated to the Java programming language:

l[l$Category==’Java’, c(‘Site’,’Users’,’Views’,’Reach’)]

I’d love to see your ideas for analyzing this data in the comments…it’s a great opportunity to show off your analytical and R skills!

To leave a comment for the author, please follow the link and comment on their blog: R-Chart. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)