Cutting down code in dplyr and data.table

Posted on August 11, 2021 by HighlandR in R bloggers | 0 Comments

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After a very long , R free hiatus (I don’t recommend it) I’m back on the R train, destination unknown.

I had a bit of spare time last night, and remembered I had not done a PreppinData challenge for a while.

In fact, the last one I did was week 8, and they are now on week 31 – the exact details of the challenge can been seen below.

PreppinData Week 31

In summary, we had to

import the data
filter to only include rows where products were sold
calculate sales per item
pivot the results
return these, and totals per Store

Now I’ve only used R a handful of times since the pandemic started (work has mainly been SQL, and lots of it, plus some BI tools thrown in to the mix) and this looked quite easy, so I quickly hopped on my laptop and fired up dplyr and chums (readr and tidyr)

Here is my first solution.

library(here)
library(dplyr)
library(readr)
library(tidyr)

setwd(here("2021-31"))

sales <- read_csv("Input.csv", 
                 col_types = cols(Date = col_date(format = "%d/%m/%Y")))
                                                                                  
sales <- sales %>% 
    filter( Status != "Return to Manufacturer") # we only want 'Sold'

totals <- sales %>% 
    group_by(Store) %>% 
    summarise(`Items Sold per Store` = sum(`Number of Items`)) # summary table  for joining total sales to pivot

sales <- sales %>% 
    select(-c('Status','Date')) %>% 
    group_by( Store, Item) %>% 
    summarise(N = sum(`Number of Items`)) %>% 
    spread(key = 'Item', value = 'N') %>% 
    left_join(totals, by = 'Store') %>% 
    select(`Items Sold per Store`, Wheels, Tyres, Saddles, Brakes, Store) # do your own Ross Geller impression here

data.table::fwrite(sales,"sales.csv")

Well, there’s nothing earth shattering in there, apart from me cheating right at the end and using data.table’s fwrite function to write out the final results.

So then I thought I’d give data.table a bash, to see how much I’d remembered.

library(here)
library(data.table)
setwd(here("2021-31"))

DT <- fread("Input.csv")
DT[, Date := as.IDate(DT$Date, format = "%d/%m/%Y")]  # specify date format
DT <- DT[Status %chin% 'Sold',][]  # filter for 'Sold'

Totals <- copy(DT)
Totals[,.(Store,`Number of Items`)][]
Totals <- Totals[, .('Total' = sum(`Number of Items`)), by = Store][]  # Summary table for joining

DT <- DT[,-c('Status', 'Date')
         ][,.(Sales = sum(`Number of Items`)), by = .(Store,Item)][]
DT <- DT[,.(Store, Item,Sales)][]  # drop columns, calculate totals, select what we need

DT <- dcast(DT, Store ~ Item, value.var = 'Sales', fun.aggregate = sum) # pivot wider
DT <- Totals[DT,  on = 'Store'][]  # join the totals to the pivot table
setnames(DT, old = 'Total', new = 'Items Sold per Store') # rename
DT[,.(`Items Sold per Store`, Wheels, Tyres, Saddles, Brakes, Store)][] # reorder once more

fwrite(DT,"DT.csv")

This was a bit unusual in that I was doing lot’s of assigning of DT to a new DT - normally, I chain commands. In both versions, the approach is the same, I’m reading in the data, removing the returns, creating a summary table to join the results to, and then removing columns I don’t need prior to pivotting. Then I join the summary table back up and reorder the columns, so that I have the total Number of Items sold, each Item sales as a column, and the Store name.

So, that was that, and as usual, I tweeted the results, tagged the PreppinData crew and that was the end of it.

Or so I thought.

My brain must have been doing some thinking, because throughout today, I started getting ideas about better (well, shorter) ways to do this. I realised I didn’t need the summary table - why didn’t I just pivot and sum the columns at the end?

And then I thought about fread(), and was certain there was a way to filter rows as the data was being read in. And I thought it would be worth checking to see if I could specify what columns were imported as well. And it turns out you can do all these things.

Here’s my improved, reduced data.table code, (leaving off the working directory stuff).

DT <- fread(cmd = paste("grep", " Sold ", "Input.csv"), drop = c(1,4), col.names = c("Store","Item","Sales")) 
DT <- dcast(DT, Store ~ Item, value.var = 'Sales', fun.aggregate = sum) 
DT[,'Items Sold per Store' := rowSums(.SD), by = Store] # Totals by Store
DT[,.(`Items Sold per Store`, Wheels, Tyres, Saddles, Brakes, Store)][]
fwrite(DT,"DT2.csv")

Wow, this is much more concise, just 4 lines doing the heavy lifting.

Let’s go through it:

DT <- fread(cmd = paste("grep", " Sold ", "Input.csv"), drop = c(1,4), col.names = c("Store","Item","Sales"))

This is a killer line, the key being the cmd argument. It uses a shell command to filter the data prior to it being read in - it searches each row for the word ‘Sold’, and discards any that don’t contain it. fread also allows us to provide vectors to select, or in this case drop columns of interest. Here, I drop the first and fourth columns ( ‘Date’ and ‘Status’ respectively - the dates are irrelevant, and we know that everything is sold now, so we don’t need to keep the ‘Status’). Finally, we provide names for our remaining three columns.

I didn’t want to use column numbers to drop the columns, but I couldn’t get it to work by providing the actual names - it kept saying they couldn’t be found. If I left off the grep command, and just used drop or select, then I could provide the relevant names and things worked. I haven’t done anything like this before in R, so it’s really quite exciting to see it workng

DT <- dcast(DT, Store ~ Item, value.var = 'Sales', fun.aggregate = sum)

Now we pivot our data wider, by Store and Item, and sum the values of the Sales column

DT[,'Items Sold per Store' := rowSums(.SD), by = Store] # Totals by Store

Another cool line, summing over all the columns in .SD, which, because we are grouping by Store, effectively means all the remaining Item columns. We don’t need to specify which columns, or that they have to be numeric, it just works, conveniently, in this case.

DT[,.(`Items Sold per Store`, Wheels, Tyres, Saddles, Brakes, Store)][]

Finally, we just grab the columns in our desired final order. I could also have used setcolorder for this, but it’s slightly more typing!

So that was it for data.table. But of course, at this point, I’d realised I could cut some steps out of the tidyverse approach too:

sales <- read_csv("Input.csv") %>% 
    filter(Status == "Sold") %>% 
    select(-c("Date", "Status")) %>% 
    group_by( Store, Item) %>% 
    summarise(N = sum(`Number of Items`)) %>% 
    spread(key = 'Item', value = 'N') %>% 
    mutate(`Items Sold per Store` = sum(c_across(where(is.numeric)))) %>% 
    select(`Items Sold per Store`, Wheels, Tyres, Saddles, Brakes, Store) %>% 
    data.table::fwrite("tidy2.csv")

The full code, with outputs, is available here: https://github.com/johnmackintosh/PreppinData/tree/main/2021-31

If I had to pick my favourite thing from these 4 approaches, it’s definitely the ability to filter rows when reading in the .csv file AND to specify which columns to keep or drop. It’s not a new feature in data.table, but it is, like a lot of features in the package, not very well known. I’m not sure if the reason I was able to use it on Windows was because I had RTools installed, so if you are a Windows user, and it doesn’t work for you, that might be a reason why. Or it might have nothing to do it with it - I haven’t a clue. I know that Linux and (probably?) Mac have these sort of tools built in.

So, quite a nice way to ease myself back into the world of R - more of which in a future post. Until then, maybe you could try fread-ing in some large datasets of your own, and experimenting with different ways of filtering them up front.

‘Bye for now!

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Cutting down code in dplyr and data.table

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)