How can you do a smart job getting data from internet?

October 9, 2011
By

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

I’d like to explore more the capabilities of my statistical packages to get data online and allocate it in memory instead of download each dataset by hand. After all, I found this task is pretty easy, but got me out of bed for one night trying to find the most efficient way to loop across the files and store them in the right way. So, let’s start. You can find one file here, with a list of web address where each file we are about to download is allocated. This files contain all registered details about revenues and expenditures of each candidate in the last election in Brazil. That’s meaning, more than 22 thousands .csv2 files; each file represents a candidate (i). For this task, I’ll use just data of revenues. Finally, I’m going to show the same steps using R.

require(xlsx)
web <- read.xlsx(file.choose(), 1)
mysites =web$web
rm(web) # remove it because I need a lot of memory;
#run this code and relax for 3 or four hours;
big.data <- NULL
base <-NULL
for (i in mysites) {
try(base <- read.table(i, sep=";", header=T, as.is=T, fileEncoding="windows-1252"), TRUE)
if(!is.null(base)) big.data <- rbind(big.data, base)
}
#… half day after
names(big.data)
head(big.data,10)
tail(big.data, 10)
fix(base)
srt(big.data)

To leave a comment for the author, please follow the link and comment on his blog: Daniel MarcelinoDaniel Marcelino » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.