Hands-on dplyr tutorial for faster data manipulation in R
[This article was first published on R - Data School, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I love dplyr. It’s my “go-to” package in R for data exploration, data manipulation, and feature engineering. I use dplyr because it saves me time: its performance is blazing fast on data frames, but even more importantly, I can write dplyr code faster than base R code. Its syntax is intuitive and its functions are well-named, and so dplyr code is easy-to-read even if you didn’t write it.
dplyr is the “next iteration” of the plyr package (focusing data frames, hence the “d”), and released version 0.1 in January 2014. It’s being developed by Hadley Wickham (author of plyr, ggplot, devtools, stringr, and many other R packages), so you know it’s a well-written, well-documented package.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Teaching dplyr using an R Markdown document
As one of the instructors for General Assembly’s 11-week Data Science course in Washington, DC, I had 30 minutes in class last week to talk about data manipulation in R, and chose to focus exclusively on dplyr. When putting together my presentation, I had a lot of great material to draw from:- Official dplyr reference manual and vignettes on CRAN: Six substantial vignettes, and counting!
- Hadley’s July 2014 webinar about dplyr: Primarily a high-level overview
- Hadley’s dplyr tutorial at the useR! 2014 conference: In-depth tutorial with lots of example code
- dplyr GitHub repo and list of releases: Good for keeping up with the latest features and known issues
glimpse
and summarise_each
), and how to query a database using dplyr. I also compare many of the dplyr commands to the equivalent commands in base R. (Thanks to Hadley, because many of the examples I use are ones he wrote!)
Watch the dplyr tutorial on YouTube
After presenting, I recorded the entire presentation as a YouTube video (embedded below), since I know it can be helpful to hear someone explaining code that is unfamiliar to you. It runs 39 minutes, but if you only want to watch a particular section, simply click the topic below and it will skip to that point in the video.- Introduction to dplyr (starts at 0:00)
- Loading dplyr and the example dataset (starts at 2:29)
- Understanding “local data frames” (starts at 3:23)
- Verb #1:
filter
(starts at 5:17) - Verb #2:
select
, pluscontains
,starts_with
,ends_with
,matches
(starts at 7:54) - Using chaining syntax for more readable code (starts at 9:34)
- Verb #3:
arrange
(starts at 12:53) - Verb #4:
mutate
(starts at 13:55) - Verb #5:
summarise
, plusgroup_by
,summarise_each
,n
,n_distinct
,tally
(starts at 15:31) - Window functions:
min_rank
,top_n
,lag
(starts at 26:47) - Convenience functions:
sample_n
,sample_frac
,glimpse
(starts at 32:44) - Connecting to databases (starts at 34:21)
To leave a comment for the author, please follow the link and comment on their blog: R - Data School.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.