IEOR Tools Tutorial: Learning XML with R

August 16, 2010

(This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers)

I have been using a lot of R lately in my work.  R (main site) is an open source statistical computing platform.  Saying R is only used for statistics does not do it justice.  I am finding it to be a really powerful statistical and optimization computing platform.  There seems to be no task that can not be accomplished.  Lately I’ve been curious about measuring performance with my blog and how it compares to other blogs.  So I thought I would use this opportunity to show how I performed this in R.  I want to rank Operations Research blogs using the Alexa ranking system.  Unfortunately Alexa does not have a search function for Operations Research blogs so I am going to have to get the information myself using R.

This R tutorial is going to be using the package XML.  Packages are used in R to perform specific computational needs that the base R platform can not accomplish on its own.  There are several different packages that can be loaded into R to perform a wide variety of problem instances. 

The first step is to load the XML package into the current R workspace.  If you do not have the XML package installed on your computer you will have to get it installed from the CRAN repositories. 

After loading the XML package is where the problem set programming begins.  I will need to save into the workspace the url of the Alexa information. Once I have the variables then I can move onto using the XML package to gather the information.

The main functions used in the XML package are htmlTreeParse, getNodeSet, and readHTMLTablehtmTreeParse grabs the XML code from the URL and stores it into an XML readable format.  getNodeSet is a retrieval function that grabs only the data you specifify.  In this instance it is looking for the XML nodes of dir and table with a id value equal to siteStats.  The readHTMLTable takes the siteStats information and creates a table of data values. 

While gathering the Alexa information with XML I’m also going to have to format the data into a readable structure.  This will require tabulating and text string manipulation.  Notice the use of the functions table, strsplit, and gsub to format the data.  All of this is performed in a for loop that performs all of XML and text formatting one URL at a time.  I’ve also created a data frame to place all of the relevant information into a readable table.

The following is the R code.


urlbeg <- “”

urllist <- c(

ORrank <- data.frame()

for (i in c(1:length(urllist)) ){
    url <- paste(urlbeg, urllist[i], sep=””)
    doc <- htmlTreeParse(url, useInternalNodes=T)

    nset <- getNodeSet(doc, “//div/table[@id=’siteStats’]”)

    tables <- lapply(nset, readHTMLTable)

    rankstr <- tables[[1]][2]
    rankstrdf <- strsplit(as.character(rankstr$V2), “\n”)
    rank <- gsub(” “,””,rankstrdf[[1]][1])
    rank <- as.numeric(gsub(“,”,””,rank))
    tmpdf <- data.frame(ORblog=urllist[i], AlexaRank=rank)

    ORrank <- rbind(ORrank, tmpdf)


ORrank <- ORrank[order(ORrank$AlexaRank),]
rownames(ORrank) <- 1:nrow(ORrank)

Here is a final output from the ORrank data frame.

                                             ORblog AlexaRank
1                    154736
2                           308410
3                             1444318
4                            1484646
5                                1504658
6               1631529
7                      1711672
8                             1955830
9                    2550459
10                    2625563
11                  3002085
12                  3303052
13                3811636
14                    4068033
15                 4281627
16                 5047922
17                       6052089
18                               6134442
19                    6674061
20           7373428
21           8516473
22                     8666209
23                9437585
24                        12225347
25                       12571553
26                        13784064
27                   15236071
28                 19401625
29             20064295
30                             21294575
31                      22329286
32  24431355
33                   25165358
34                       25304653
35            27537074
36           NA
37                   NA
38                           NA
39             NA
40                         NA
41                                   NA
42                          NA
43                        NA
44                           NA
45                     NA
46                         NA
47              NA

Not exactly in the friendliest of formats but it does the trick.  I hope that this will help others who wish to use the powerful XML package with R.  I know I have definitely learned a lot about XML in the process.  I also found out that I have a lot more work to do with my blog.

Note:  If you are wondering where Michael Trick’s blog is located there is a reason.  Unfortunately his blog and some others are in a sub-domain of another url not affiliated with his blog.  This means Alexa can not rank it compared to blogs with a primary domain.  Yet everyone in the Operations Research community knows where Michael’s blog ranks anyway.

To leave a comment for the author, please follow the link and comment on their blog: Maximize Productivity with Industrial Engineer and Operations Research Tools. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)