Top 10 dplyr Functions – Data Analysis Made Easy

[This article was first published on r – Better Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been a Python fanboy for the last couple of years. The language is excellent for data science, and being a general-purpose language makes it that much easier to work on non-data-science parts of data science. But can R’s dplyr beat Python’s pandas in the data exploration and preparation domain? Yes, it can.

Today we’ll see how to use 10 of the most common dplyr functions. That’s not all the package has to offer, so refer to this link for a complete list. We’ll perform the entire analysis with the Gapminder dataset, available directly in R.

Here’s how to import all the packages we’ll need today:

library(dplyr)
library(gapminder)

Let’s begin the exploration from the next section.

select

The select command is used when we want to, well, select one or multiple columns. Think of it as a SELECT statement in SQL, where we can pick only columns of interest, and leave others untouched.

This is the first way of using select – by listing all the columns we want to keep:

gapminder %>% 
  select(country, year, lifeExp)

select

The other method is by specifying columns we want to remove – done with the - prefix:

gapminder %>% 
  select(-continent)

select 2

filter

The filter command is equivalent to the WHERE keyword in SQL, as we can fetch only the subset we’re interested in. There are multiple ways to use this command, and all are shown below.

Let’s keep only those rows where the year is 1952:

gapminder %>% 
  filter(year == 1952)

filter 1

Let’s add another filter – the year is still 1952 but we want only the data for Croatia:

gapminder %>%
  filter(year == 1952, country == 'Croatia')

filter 2

That’s great, but how can we filter by multiple years. The %in% operator comes in handy here:

gapminder %>%
  filter(year %in% c(1952, 1957), country == 'Croatia')

filter 3

arrange

The arrange operator is used to sort the data by individual columns, either in ascending or descending order. The ascending ordering is the default. It is equivalent to the ORDER BY keyword in SQL.

Let’s sort the Gapminder dataset by population in ascending order:

gapminder %>% 
  arrange(pop)

arrange 1

Let’s do the same in descending order:

gapminder %>% 
  arrange(desc(pop))

arrange 2

distinct

This operator is used to find unique elements of a given attribute. It doesn’t need any further clarification, so let’s get straight to the code.

Here’s how we can see the unique continents in the dataset:

gapminder %>% 
  distinct(continent)

distinct

mutate

The mutate operator is used to create new columns. All of the other columns stay intact during this operation.

Let’s use it to calculate the GDP of a country by multiplying population by the GDP per capita:

gapminder %>%
  mutate(totalGdp = pop * gdpPercap)

mutate

transmute

This operator is nearly identical to mutate, the only difference being that transmute doesn’t keep the other attributes – only the new one is kept.

We’ll use it once again to calculate total GDP and then compare results with the mutate operator:

gapminder %>%
  transmute(totalGdp = pop * gdpPercap)

transmute

As promised, only the new totalGdp column is shown.

rename

As the name suggests, this operator is used for renaming. Be careful about the syntax, as we must specify the new name first.

Let’s rename a couple of columns – lifeExp to lifeExpectancy and pop to population:

gapminder %>% 
  rename(lifeExpectancy = lifeExp, population = pop)

rename

summarize

The summarize or summarise operators are useful for creating summary statistics on a dataset. If no grouping operations are performed, the result is a single value. 

For example, let’s say we want to find overall average life expectancy:

gapminder %>% 
  summarize(meanLifeExp = mean(lifeExp))

summarise 1

We can chain other operators if we want to, let’s say find average life expectancy in 2007:

gapminder %>% 
  filter(year == 2007) %>% 
  summarize(meanLifeExp = mean(lifeExp))

summarise 2

That’s great, but summarize is the most useful when used with grouping operations. Let’s discuss those next.

group_by

We can use grouping operators to utilize the full potential of previously discussed operators, such as summarize. This one is equivalent to the GROUP_BY keyword in SQL, so you’re probably already familiar with it.

Suppose we want to find the average life expectancy per continent – impossible to do without a grouping operator:

gapminder %>% 
  group_by(continent) %>% 
  summarize(meanLifeExp = mean(lifeExp))

group by

We can chain multiple operators together, and perform groupings on multiple columns. For example, let’s see how to find an average life expectancy per continent, grouped further by years (only 2002 and 2007):

gapminder %>% 
  filter(year %in% c(2002, 2007)) %>% 
  group_by(continent, year) %>% 
  summarize(meanLifeExp = mean(lifeExp))

group by 2

if_else

This is not your traditional operator, as we use it inside other operators, such as mutate. It lets us perform a vectorized if-else statement without writing multiple lines of code.

Let’s see how we can group by continent and year (only 2002 and 2007) and calculate the average life expectancy, and then create another column which has the value of ‘Y’ if the average life expectancy is greater than 70, and ‘N’ otherwise:

gapminder %>% 
  filter(year %in% c(2002, 2007)) %>% 
  group_by(continent, year) %>% 
  summarize(meanLifeExp = mean(lifeExp)) %>% 
  mutate(meanLifeExpOver70 = if_else(meanLifeExp > 70, 'Y', 'N'))

if else

Parting words

At the end of the day, anything done in dplyr can be done in pandas. But you have to admit – dplyr does it so effortlessly and in a much simpler and cleaner way. Comparison between the two is a potential topic for another time, so let me know if you’re interested in that.

Also, let me know which one you prefer and why – and try to keep your opinion unbiased.

Join my private email list for more helpful insights.

The post Top 10 dplyr Functions – Data Analysis Made Easy appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: r – Better Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)