R⁶ — Idiomatic (for the People)

May 23, 2017
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria.

A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: idiomatic is in the eye of the beholder.

I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled.

I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of idiomatic:

using, containing, or denoting expressions that are natural to a native speaker

That comes from idiom:

a form of expression natural to a language, person, or group of people

I usually joke with my students that a strength (and weakness) of R is that there are ~twelve ways to do any given task. While the statement is deliberately hyperbolic, the core message is accurate: there’s more than one way to do most things in R. A cascading truth is: what makes one way more “correct” over another often comes down to idiom.

My rstudio::conf 2017 presentation included an example of my version of using purrr for idiomatic CSV/JSON directory slurping. There are lots of ways to do this in R (the point of the post is not really to show you how to do the directory slurping and it is unlikely that I’ll approve comments with code snippets about that task). Here are three. One from base R tribe, one from the data.table tribe and one from the tidyverse tribe:

# We need some files and we'll use base R to make some
dir.create("readings")
for (i in 1970:2010) write.csv(mtcars, file.path("readings", sprintf("%s.csv", i)), row.names=FALSE)

fils <- list.files("readings", pattern = ".csv$", full.names=TRUE)

do.call(rbind, lapply(fils, read.csv, stringsAsFactors=FALSE))

data.table::rbindlist(lapply(fils, data.table::fread))

purrr::map_df(fils, readr::read_csv)

You get data for all the “years” into a data.frame, data.table and tibble (respectively) with those three “phrases”.

However, what if you want the year as a column? Many of these “datalogger” CSV data sets do not have a temporal “grouping” variable as they let the directory structure & naming conventions embed that bit of metadata. That information would be nice, though:

do.call(rbind, lapply(fils, function(x) {
  f <- read.csv(x, stringsAsFactors=FALSE)
  f$year <- gsub("^readings/|\\.csv$", "", x)
  f
}))

dt <- data.table::rbindlist(lapply(fils, data.table::fread), idcol="year")
dt[, year := gsub("^readings/|\\.csv$", "", fils[year])]

purrr::map_df(fils, readr::read_csv, .id = "year") %>% 
  dplyr::mutate(year = stringr::str_replace_all(fils[as.numeric(year)],
                                                "^readings/|\\.csv$", ""))

All three versions do the same thing, and each tribe understands each idiom.

The data.table and tidyverse versions get you much faster file reading and the ability to “fill” missing columns — another common slurping task. You can hack something together in base R to do column fills (you’ll find a few StackOverflow answers that accomplish such a task) but you will likely decide to choose one of the other idioms for that and become equally as comfortable in that new idiom.

There are multiple ways to further extend the slurping example, but that’s not the point of the post.

Each set of snippets contains 100% valid R code. They accomplish the task and are idiomatic for each tribe. Despite what any “mil gun feos turrach na latsa” experts’ exchange would try to tell you, the best idiom is the one that works for you/you & your collaborators and the one that gets you to the real work — data analysis — in the most straightforward & reproducible way possible (for you).

Idiomatic does not mean there’s only a singular One, True Way™, and I think a whole host of us forget that at times.

Write good, clean, error-free, reproducible code.

Choose idioms that work best for you and your collaborators.

Adapt when necessary.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)