How to import a subset of a (too huge) csv file

[This article was first published on R – Inside data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

01_sqldf

Sometimes people create csv files that are just too huge to upload them into your R session while most of the times you just need a subset of this data set. Recently I tapped into this problem and first I tried to import the whole file with functions like fread or the classic read.csv but this did not help much as the file was just too big and my computer failed to import it. With the awesome read.csv.sql function from the sqldf package I found a good way to solve my problem. This function enables you to use SQL statements within the import function which make it possible to select only a subset of the file to reduce the import size.

mydata = read.csv.sql("mydata.csv", sql = "select * from file where City = '\"Hamburg\"' ", sep = ";")

The code above loads only those lines of the file in where the city is Hamburg. I still had trouble to with the encoding that is why I used this ugly string with backlashes in the SQL statement. I will leave it like this as you might having the same problem.

Make sure you check out my Github for other data driven projects.

To leave a comment for the author, please follow the link and comment on their blog: R – Inside data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)