Window Functions in SQL vs. R

[This article was first published on Pete Talbert, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Both MS SQL Server and R’s dplyr package offer window functions for different operations on a dataset. What are window functions? Window functions operate on a set of rows or “window” and return a value for each row, whether that’s something like a row number or percentile rank, or an aggregate result from a min, max, average, or sum.

The most common problem I have that window functions solve is when I have a set of records that I need to unduplicate. Let’s say I have student enrollment records where an individual student may have more than one enrollment; and let’s say I want to grab only the latest enrollment record from each student. In this case, each student would be a partition or window in which I could create a function to label the latest record. Once the latest record is identified, I could then pull just that single row.

SQL

How would I do this in SQL? Below I connect to my MS SQL Server instance and print a very simple table of student enrollment records. You can see there are more than one per studentID. How would I grab the latest from each per the date column?

library(tidyverse)
library(DBI)
library(odbc)
library(knitr)

con <- dbConnect(odbc::odbc(), "nicodemus", database = "Demo")

enroll <- as_tibble(dbGetQuery(con, "select * from Demo.dbo.tbl_Enroll"))

enroll
## # A tibble: 11 x 3
##    studentID enrollmentType date               
##                                
##  1 1000      D              2019-02-01 00:00:00
##  2 1000      P              2019-01-20 00:00:00
##  3 1001      D              2019-02-02 00:00:00
##  4 1001      P              2018-12-01 00:00:00
##  5 1001      P              2018-12-30 00:00:00
##  6 1002      D              2019-01-15 00:00:00
##  7 1002      D              2018-11-30 00:00:00
##  8 1003      P              2018-10-31 00:00:00
##  9 1003      P              2019-02-28 00:00:00
## 10 1003      D              2018-09-30 00:00:00
## 11 1003      P              2019-01-01 00:00:00

In the SQL below, I am using ROW_NUMBER() to make this happen. I define the window by using the partition by clause and then I use order by to tell SQL in what I want the rows numbered. Here, I want desc because I want the rows labeled 1 to be the latest enrollment record per student.

select *
  , ROW_NUMBER() over (partition by studentID order by [date] desc) as rowNum --SQL window function syntax
from Demo.dbo.tbl_Enroll
## # A tibble: 11 x 4
##    studentID enrollmentType date                rowNum         
##                                 
##  1 1000      D              2019-02-01 00:00:00 1              
##  2 1000      P              2019-01-20 00:00:00 2              
##  3 1001      D              2019-02-02 00:00:00 1              
##  4 1001      P              2018-12-30 00:00:00 2              
##  5 1001      P              2018-12-01 00:00:00 3              
##  6 1002      D              2019-01-15 00:00:00 1              
##  7 1002      D              2018-11-30 00:00:00 2              
##  8 1003      P              2019-02-28 00:00:00 1              
##  9 1003      P              2019-01-01 00:00:00 2              
## 10 1003      P              2018-10-31 00:00:00 3              
## 11 1003      D              2018-09-30 00:00:00 4

How would I then grab the latest record? Create an outer query that just selects where rowNum = 1.

select * from (
  select *
    , ROW_NUMBER() over (partition by studentID order by [date] desc) as rowNum
  from Demo.dbo.tbl_Enroll
) t
where rowNum = 1 --outer query to get just the latest record per student.
## # A tibble: 4 x 4
##   studentID enrollmentType date                rowNum         
##                                
## 1 1000      D              2019-02-01 00:00:00 1              
## 2 1001      D              2019-02-02 00:00:00 1              
## 3 1002      D              2019-01-15 00:00:00 1              
## 4 1003      P              2019-02-28 00:00:00 1

R

R’s dplyr package also has window functions to do the same operation. What happens when I just apply rownumber() to the data?

enroll %>% 
  mutate(rowNum = row_number())
## # A tibble: 11 x 4
##    studentID enrollmentType date                rowNum
##                                  
##  1 1000      D              2019-02-01 00:00:00      1
##  2 1000      P              2019-01-20 00:00:00      2
##  3 1001      D              2019-02-02 00:00:00      3
##  4 1001      P              2018-12-01 00:00:00      4
##  5 1001      P              2018-12-30 00:00:00      5
##  6 1002      D              2019-01-15 00:00:00      6
##  7 1002      D              2018-11-30 00:00:00      7
##  8 1003      P              2018-10-31 00:00:00      8
##  9 1003      P              2019-02-28 00:00:00      9
## 10 1003      D              2018-09-30 00:00:00     10
## 11 1003      P              2019-01-01 00:00:00     11

Great, it gave the row number of each row in the dataset. What happens when I put the column I want partitioned inside row_number()?

enroll %>% 
  mutate(rowNum = row_number(studentID))
## # A tibble: 11 x 4
##    studentID enrollmentType date                rowNum
##                                  
##  1 1000      D              2019-02-01 00:00:00      1
##  2 1000      P              2019-01-20 00:00:00      2
##  3 1001      D              2019-02-02 00:00:00      3
##  4 1001      P              2018-12-01 00:00:00      4
##  5 1001      P              2018-12-30 00:00:00      5
##  6 1002      D              2019-01-15 00:00:00      6
##  7 1002      D              2018-11-30 00:00:00      7
##  8 1003      P              2018-10-31 00:00:00      8
##  9 1003      P              2019-02-28 00:00:00      9
## 10 1003      D              2018-09-30 00:00:00     10
## 11 1003      P              2019-01-01 00:00:00     11

That did nothing. This is without reading the documentation obviously. What that did was just give a row number for the studentID vector regardless of duplication; I could have put any of the columns inside the function and it would return the same result. Not what we want. What we want to do is put studentID in a group_by clause before mutating.

enroll %>% 
  group_by(studentID) %>%  #adding the group_by clause to partition by studentID.
  mutate(rowNum = row_number())
## # A tibble: 11 x 4
## # Groups:   studentID [4]
##    studentID enrollmentType date                rowNum
##                                  
##  1 1000      D              2019-02-01 00:00:00      1
##  2 1000      P              2019-01-20 00:00:00      2
##  3 1001      D              2019-02-02 00:00:00      1
##  4 1001      P              2018-12-01 00:00:00      2
##  5 1001      P              2018-12-30 00:00:00      3
##  6 1002      D              2019-01-15 00:00:00      1
##  7 1002      D              2018-11-30 00:00:00      2
##  8 1003      P              2018-10-31 00:00:00      1
##  9 1003      P              2019-02-28 00:00:00      2
## 10 1003      D              2018-09-30 00:00:00      3
## 11 1003      P              2019-01-01 00:00:00      4

Sweet! But wait, this just listed the row number for each studentID in whatever order it came in. We want it to list 1 for the lastest record per the date column.

enroll %>% 
  group_by(studentID) %>%  
  mutate(rowNum = row_number(desc(date))) %>% #specify how to list row numbers here; desc for latest first.
  filter(rowNum == 1) #lastly, filter to just the latest record.
## # A tibble: 4 x 4
## # Groups:   studentID [4]
##   studentID enrollmentType date                rowNum
##                                 
## 1 1000      D              2019-02-01 00:00:00      1
## 2 1001      D              2019-02-02 00:00:00      1
## 3 1002      D              2019-01-15 00:00:00      1
## 4 1003      P              2019-02-28 00:00:00      1

Nice! I really wrote this to learn the nuances for how to do the same operation I do in SQL in R for my particular use case. The vignette provided by the tidyverse team did not really suit my needs in this instance, but others my find it helpful as further reading. Cheers!

To leave a comment for the author, please follow the link and comment on their blog: Pete Talbert.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)