# How can you do a smart job getting data from internet?

October 9, 2011
By

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

I’d like to explore more the capabilities of my statistical packages to get data online and allocate it in memory instead of download each dataset by hand. After all, I found this task is pretty easy, but got me out of bed for one night trying to find the most efficient way to loop across the files and store them in the right way. So, let’s start. You can find one file here, with a list of web address where each file we are about to download is allocated. This files contain all registered details about revenues and expenditures of each candidate in the last election in Brazil. That’s meaning, more than 22 thousands .csv2 files; each file represents a candidate (i). For this task, I’ll use just data of revenues. Finally, I’m going to show the same steps using R.

require(xlsx) web <- read.xlsx(file.choose(), 1) mysites =web\$web rm(web) # remove it because I need a lot of memory; #run this code and relax for 3 or four hours; big.data <- NULL base <-NULL for (i in mysites) { try(base <- read.table(i, sep=";", header=T, as.is=T, fileEncoding="windows-1252"), TRUE) if(!is.null(base)) big.data <- rbind(big.data, base) } #… half day after names(big.data) head(big.data,10) tail(big.data, 10) fix(base) srt(big.data)