Know Your Dataset: Specifying colClasses to load up an ffdf

[This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I finally figured out how to successfully use the ff package to load data into R, I was apparently working with relatively pain free data to load up through read.csv.ffdf (see my previous post).  Just this past Sunday, I naively followed my own post to load a completely new dataset (over 400,000 rows and about 180 columns) for analysis.  Unfortunately for me, the data file was a bit messier, and so read.csv.ffdf wasn’t able to finalize the column classes by itself.  It would chug along until certain columns in my dataset, which it at first took to be one data type, proved to be a different data type, and then it would give me an error message basically telling me it didn’t want to adapt to the changing assumptions of which data type each column represented.

So, I set out to learn how I could use the colClasses argument in the read.csv.ffdf command to manually set the data types for each column.  I adapted the following solution from a stackoverflow thread about specifying colClasses in the regular read.csv function.

First, load up a sample of the big dataset using the read.csv command (The following is obviously non-random. If you can figure out how to read the sample in randomly, I think it would work much better):

headset = read.csv(fname, header = TRUE, nrows = 5000)

The next command generates a list of all the variable names in your dataset, and the classes R was able to derive based on the number of rows you imported:

headclasses = sapply(headset, class)

Now comes the fairly manual part. Look at the list of variables and classes (data types) that you generated, and look for obvious mismatches. Examples could be a numeric variable that got coded as a factor or logical, or a factor that got coded as a numeric. When you find such a mismatch, the following syntax suffices for changing a class one at a time:

headclasses["variable.name"] = "numeric"

Obviously, the “variable.name” should be replaced by the actual variable name you’re reclassifying, and the “numeric” string can also be “factor”, “ordered”, “Date”, “POSIXct” (the last two being date/time data types). Finally, let’s say you want to change every variable that got coded as “logical” into “numeric”. Here’s some syntax you can use:

headclasses[grep("logical", headclasses)] = "numeric"

Once you are certain that all the classes represented in the list you just generated and modified are accurate to the dataset, you can load up the data with confidence, using the headclasses list:

bigdataset = read.csv.ffdf(file="C:/big/data/loc.csv", first.rows=5000, colClasses=headclasses)

This was certainly not easy, but I must say that I seem to be willing to jump through many hoops for R!!


To leave a comment for the author, please follow the link and comment on their blog: Data and Analysis with R, at Work.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)