Filter data frame rows

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We often want to operate only on a specific subset of rows of a data frame. The dplyr filter() function provides a flexible way to extract the rows of interest based on multiple conditions.

  • Use the filter() function to sort out the rows of a data frame that fulfill a specified condition
  • Filter a data frame by multiple conditions
filter(my_data_frame, condition)
filter(my_data_frame, condition_one, condition_two, ...)

The filter() function

filter(my_data_frame, condition)
filter(my_data_frame, condition_one, condition_two, ...)

The filter() function takes a data frame and one or more filtering expressions as input parameters. It processes the data frame and keeps only the rows that fulfill the defined filtering expressions. These expressions can be seen as rules for the evaluation and keeping of rows. In the majority of the cases, they are based on relational operators. As an example, we could filter the pres_results data frame and keep only the rows, where the state variable is equal to "CA" (California):

filter(pres_results, state == "CA")
# A tibble: 11 x 6
    year state total_votes   dem   rep  other
               
 1  1976 CA        7803770 0.480 0.497 0.0230
 2  1980 CA        8582938 0.359 0.527 0.114 
 3  1984 CA        9505041 0.413 0.575 0.0122
 4  1988 CA        9887065 0.476 0.511 0.0131
 5  1992 CA       11131721 0.460 0.326 0.213 
 6  1996 CA       10019469 0.511 0.382 0.107 
 7  2000 CA       10965822 0.534 0.417 0.0490
 8  2004 CA       12421353 0.543 0.444 0.0117
 9  2008 CA       13561900 0.610 0.370 0.0188
10  2012 CA       13038547 0.602 0.371 0.0246
11  2016 CA       14181595 0.617 0.316 0.0581

In the output, we can compare the election results in California for different years.

As another example, we could filter the pres_results data frame and keep only those rows, where the dem variable (percentage of votes for the Democratic Party) is greater than 0.85:

filter(pres_results, dem > 0.85)
# A tibble: 7 x 6
   year state total_votes   dem    rep   other
                
1  1984 DC         211288 0.854 0.137  0.00886
2  1996 DC         185726 0.852 0.0934 0.0513 
3  2000 DC         201894 0.852 0.0895 0.0563 
4  2004 DC         227586 0.892 0.0934 0.0125 
5  2008 DC         265853 0.925 0.0653 0.00582
6  2012 DC         293764 0.909 0.0728 0.0155 
7  2016 DC         312575 0.905 0.0407 0.0335 

In the output we can see for each election year the states where the Democratic Party got over 85% of the votes. Based on the results, we could say that the Democratic Party has a solid voter base in the District of Columbia (known as Washington, D.C.).

Exercise: Use filter() with a single expression

The gapminder dataset contains economic and demographic data about various countries since 1952.

Inspect the data for a single year by using the filter() function.

  1. Apply the filter() function on the gapminder dataset
  2. Keep only the rows where the year is equal to 2007

Note that the dplyr and gapminder packages are already loaded.

Start Exercise

Quiz: filter() Function

Which of the following statements about the filter() function are correct?

  • Relational operators, such as == or >, are frequently part of the filtering expressions.
  • The filter() function comes in the dplyr package.
  • Only numeric variables can be filtered.
  • The filter() function works only on data frames, not on tibbles.

Start Quiz

Multiple filter expressions

filter(my_data_frame, condition)
filter(my_data_frame, condition_one, condition_two, ...)

The filter() function can take multiple filtering rules as input as well. These can be seen as a combination of rules with the & operator. In order for a row to be included in the output, all filtering rules must be fulfilled by it. In the following example, we filter the pres_results data frame for all rows where the state variable is equal to "CA" and the year variable is equal to 2016:

filter(pres_results, state == "CA", year==2016)
# A tibble: 1 x 6
   year state total_votes   dem   rep  other
              
1  2016 CA       14181595 0.617 0.316 0.0581

We get a single row as output, containing the 2016 US presidential election results for California state.

Exercise: Use filter() with multiple rules

The gapminder dataset contains economic and demographic data about various countries since 1952. Filter the tibble and inspect which countries had a life expectancy over 80 years in the year 2007! The required packages are already loaded.

  1. Use the filter() function on the gapminder tibble.
  2. Filter all rows where the year variable is equal to 2007 and the life expectancy lifeExp is greater than 80!

Start Exercise

Exercise

The gapminder dataset contains economic and demographic data about various countries since 1952. Filter the gapminder tibble and inspect which countries had a population of over 1.000.000.000 in the year 2007! The required packages are already loaded.

  1. Use the filter() function on the gapminder tibble.
  2. Filter all rows where the year variable is equal to 2007 and the population pop is greater than 1000000000!

Start Exercise

Filter data frame rows is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)