I’ve been a Python fanboy for the last couple of years. The language is excellent for data science, and being a general-purpose language makes it that much easier to work on non-data-science parts of data science. But can R’s
dplyr beat Python’s
pandas in the data exploration and preparation domain? Yes, it can.
Today we’ll see how to use 10 of the most common
dplyr functions. That’s not all the package has to offer, so refer to this link for a complete list. We’ll perform the entire analysis with the Gapminder dataset, available directly in R.
Here’s how to import all the packages we’ll need today:
Let’s begin the exploration from the next section.
The select command is used when we want to, well, select one or multiple columns. Think of it as a
SELECT statement in SQL, where we can pick only columns of interest, and leave others untouched.
This is the first way of using
select – by listing all the columns we want to keep:
gapminder %>% select(country, year, lifeExp)
The other method is by specifying columns we want to remove – done with the
gapminder %>% select(-continent)
The filter command is equivalent to the
WHERE keyword in SQL, as we can fetch only the subset we’re interested in. There are multiple ways to use this command, and all are shown below.
Let’s keep only those rows where the year is 1952:
gapminder %>% filter(year == 1952)
Let’s add another filter – the year is still 1952 but we want only the data for Croatia:
gapminder %>% filter(year == 1952, country == 'Croatia')
That’s great, but how can we filter by multiple years. The
%in% operator comes in handy here:
gapminder %>% filter(year %in% c(1952, 1957), country == 'Croatia')
arrange operator is used to sort the data by individual columns, either in ascending or descending order. The ascending ordering is the default. It is equivalent to the
ORDER BY keyword in SQL.
Let’s sort the Gapminder dataset by population in ascending order:
gapminder %>% arrange(pop)
Let’s do the same in descending order:
gapminder %>% arrange(desc(pop))
This operator is used to find unique elements of a given attribute. It doesn’t need any further clarification, so let’s get straight to the code.
Here’s how we can see the unique continents in the dataset:
gapminder %>% distinct(continent)
mutate operator is used to create new columns. All of the other columns stay intact during this operation.
Let’s use it to calculate the GDP of a country by multiplying population by the GDP per capita:
gapminder %>% mutate(totalGdp = pop * gdpPercap)
This operator is nearly identical to
mutate, the only difference being that
transmute doesn’t keep the other attributes – only the new one is kept.
We’ll use it once again to calculate total GDP and then compare results with the
gapminder %>% transmute(totalGdp = pop * gdpPercap)
As promised, only the new
totalGdp column is shown.
As the name suggests, this operator is used for renaming. Be careful about the syntax, as we must specify the new name first.
Let’s rename a couple of columns – lifeExp to lifeExpectancy and pop to population:
gapminder %>% rename(lifeExpectancy = lifeExp, population = pop)
summarise operators are useful for creating summary statistics on a dataset. If no grouping operations are performed, the result is a single value.
For example, let’s say we want to find overall average life expectancy:
gapminder %>% summarize(meanLifeExp = mean(lifeExp))
We can chain other operators if we want to, let’s say find average life expectancy in 2007:
gapminder %>% filter(year == 2007) %>% summarize(meanLifeExp = mean(lifeExp))
That’s great, but
summarize is the most useful when used with grouping operations. Let’s discuss those next.
We can use grouping operators to utilize the full potential of previously discussed operators, such as
summarize. This one is equivalent to the
GROUP_BY keyword in SQL, so you’re probably already familiar with it.
Suppose we want to find the average life expectancy per continent – impossible to do without a grouping operator:
gapminder %>% group_by(continent) %>% summarize(meanLifeExp = mean(lifeExp))
We can chain multiple operators together, and perform groupings on multiple columns. For example, let’s see how to find an average life expectancy per continent, grouped further by years (only 2002 and 2007):
gapminder %>% filter(year %in% c(2002, 2007)) %>% group_by(continent, year) %>% summarize(meanLifeExp = mean(lifeExp))
This is not your traditional operator, as we use it inside other operators, such as
mutate. It lets us perform a vectorized if-else statement without writing multiple lines of code.
Let’s see how we can group by continent and year (only 2002 and 2007) and calculate the average life expectancy, and then create another column which has the value of ‘Y’ if the average life expectancy is greater than 70, and ‘N’ otherwise:
gapminder %>% filter(year %in% c(2002, 2007)) %>% group_by(continent, year) %>% summarize(meanLifeExp = mean(lifeExp)) %>% mutate(meanLifeExpOver70 = if_else(meanLifeExp > 70, 'Y', 'N'))
At the end of the day, anything done in
dplyr can be done in
pandas. But you have to admit –
dplyr does it so effortlessly and in a much simpler and cleaner way. Comparison between the two is a potential topic for another time, so let me know if you’re interested in that.
Also, let me know which one you prefer and why – and try to keep your opinion unbiased.