Scraping pages and downloading files using R

October 1, 2012
By

(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)

I have written a few posts discussing descriptive analyses of evaluation of National Standards for New Zealand primary schools.The data for roughly half of the schools was made available by the media, but the full version of the dataset is provided in a single-school basis. In the page for a given school there may be link to a PDF file with the information on standards sent by the school to the Ministry of Education.

I’d like to keep a copy of the PDF reports for all the schools for which I do not have performance information, so I decided to write an R script to download just over 1,000 PDF files. Once I can identify all the schools with missing information I just loop over the list, using the fact that all URL for the school pages start with the same suffix. I download the page, look for the name of the PDF file and then download the PDF file, which is named school_schoolnumber.pdf. And that’s it.

Of course life would be a lot simpler if the Ministry of Education made the information available in a usable form for analysis.

library(XML) # HTML processing
options(stringsAsFactors = FALSE)

# Base URL
base.url = 'http://www.educationcounts.govt.nz/find-a-school/school/national?school='
download.folder = '~/Downloads/schools/'

# Schools directory
directory <- read.csv('Directory-Schools-Current.csv')
directory <- subset(directory, 
                    !(school.type %in% c("Secondary (Year 9-15)", "Secondary (Year 11-15)")))

# Reading file obtained from stuff.co.nz obtained from here:
# http://schoolreport.stuff.co.nz/index.html
fairfax <- read.csv('SchoolReport_data_distributable.csv')
fairfax <- subset(fairfax, !is.na(reading.WB)) 

# Defining schools with missing information
to.get <- merge(directory, fairfax, by = 'school.id', all.x = TRUE)
to.get <- subset(to.get, is.na(reading.WB))

# Looping over schools, to find name of PDF file
# with information and download it

for(school in to.get$school.id){
  
  # Read HTML file, extract PDF link name
  cat('Processing school ', school, '\n')
  doc.html <- htmlParse(paste(base.url, school, sep = ''))
  doc.links <- xpathSApply(doc.html, "//a/@href")
  pdf.url <- as.character(doc.links[grep('pdf', doc.links)])
  if(length(pdf.url) > 0) {
    pdf.name <- paste(download.folder, 'school_', school, '.pdf', sep = '')
    download.file(pdf.url, pdf.name, method = 'auto', quiet = FALSE, mode = "w",
                  cacheOK = TRUE, extra = getOption("download.file.extra"))
  }
}

Can you help?

It would be great if you can help me to get the information from the reports. The following link randomly chooses a school, click on the “National Standards” tab and open the PDF file.

Then type the achievement numbers for reading, writing and mathematics in this Google Spreadsheet. No need to worry about different values per sex or ethnicity; the total values will do.

Gratuitous picture: a simple summer lunch (Photo: Luis).

To leave a comment for the author, please follow the link and comment on his blog: Quantum Forest » rblogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.