Select columns from a data frame

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To select only a specific set of interesting data frame columns dplyr offers the select() function to extract columns by names, indices and ranges. You can even rename extracted columns with select().

  • Learn to use the select() function
  • Select columns from a data frame by name or index
  • Rename columns from a data frame
select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

Selecting by name

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

In this chapter we will have a look at the pres_results dataset from the politicaldata package. It contains data about US presidential elections since 1976, converted to a Tibble for nicer printing.

# A tibble: 561 x 6
   year state total_votes   dem   rep   other
               
1  1976 AK         123574 0.357 0.579 0.0549 
2  1976 AL        1182850 0.557 0.426 0.0163 
3  1976 AR         767535 0.650 0.349 0.00134
# … with 558 more rows

For this example, we will have a look at the number of total votes in different states at different elections. Since we are only interested in the number of people who voted we would like to create a custom version of the pres_results data frame that only contains the columns year, state and total_votes. For such filtering, we can use the select() fiction from the dplyr package.

The select() function takes a data frame as an input parameter and lets us decide which of the columns we want to keep from it. The output of the function is a data frame with all rows, but containing only the columns we explicitly select.

We can reduce our dataset to only year, state and total_votes in the following way:

select(pres_results, year, state, total_votes)
# A tibble: 561 x 3
   year state total_votes
          
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

As the first parameter we passed the pres_results data frame, as the remaining parameters we passed the columns we want to keep to select().

Apart from keeping the columns we want, the select() function also keeps them in the same order as we specified in the function parameters.

If we change the order of the parameters when we call the function, the columns of the output change accordingly:

select(pres_results, total_votes, year, state)
# A tibble: 561 x 3
  total_votes  year state
          
1      123574  1976 AK   
2     1182850  1976 AL   
3      767535  1976 AR   
# … with 558 more rows

Exercise: Life expectancy in Austria

The gapminder_austria dataset contains information about the economic and demographic change in Austria over the last decades. To inspect how the life expectancy in Austria changed over time, create a subset of the tibble that contains only the necessary columns for this task:

  1. Use the dplyr select() function and define gapminder_austria as the input tibble.
  2. Keep only the columns year and lifeExp in the output dataset.

Note that the dplyr package is already loaded.

Start Exercise

Renaming columns

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

In addition to defining the columns we want keep, we can also rename them. To do this, we need to set the new column name inside the select() function using the command

new_column_name = current_column

In the following example, we select the columns year, state and total_votes but rename the year column to Election in the output:

select(pres_results, Election = year, state, total_votes)
# A tibble: 561 x 3
  Election state total_votes
             
1     1976 AK         123574
2     1976 AL        1182850
3     1976 AR         767535
# … with 558 more rows

Exercise: Rename columns

The gapminder_india dataset contains information about the economic and demographic change in India over the last decades. Inspect how the population in India changed over time:

  1. Use the select() function and define gapminder_india as the input tibble.
  2. Keep only the columns year and pop.
  3. Rename the pop column to population in the output tibble.

Note that the dplyr package is already loaded.

Start Exercise

Selecting by name range

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

When we use the select() function and define the columns we want to keep, dplyr does not actually use the name of the columns but the index of the columns in the data frame. This means, when we define the first three columns of the pres_results data frame, year, state and total_votes, dplyr converts these names to the index values 1, 2 and 3. We can therefore also use the name of the columns, apply the : operator and define ranges of columns, that we want to keep:

select(pres_results, year:total_votes)
# A tibble: 561 x 3
   year state total_votes
          
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

What the year:total_votes does, can be translated to 1:3, which is simply creating a vector of numerical values from 1 to 3. Then, the select() function takes the pres_results data frame and outputs a subset of it, keeping only the first three columns.

Exercise: Select a name range

The gapminder_europe_2007 dataset contains economic and demographic information about European countries for the year 2007:

# A tibble: 30 x 6
  country continent  year lifeExp      pop gdpPercap
                      
1 Albania Europe     2007    76.4  3600523     5937.
2 Austria Europe     2007    79.8  8199783    36126.
3 Belgium Europe     2007    79.4 10392226    33693.
# … with 27 more rows

Create a subset of the tibble and compare the life expectancy in different European countries for the year 2007:

  1. Apply the select() function on the gapminder_europe_2007 tibble.
  2. Use the : operator and select the columns from country to lifeExp.

Note that the dplyr package is already loaded.

Start Exercise

Select() by indices

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)

The select() function can be used with column indices as well. Instead of using names we need to specify the columns we want to select by their indices. Compared to other programming languages the indexing in R starts with one instead of zero. To select the first, fourth and fifth column from the pres_results dataset we can write

select(pres_results, 1,4,5)
# A tibble: 561 x 3
   year   dem   rep
    
1  1976 0.357 0.579
2  1976 0.557 0.426
3  1976 0.650 0.349
# … with 558 more rows

Similarly to defining ranges of columns using their names, we can define ranges (or vectors) of index values instead:

select(pres_results, 1:3)
# A tibble: 561 x 3
   year state total_votes
          
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

Exercise: Select by indices

The gapminder_europe_2007 dataset contains economic and demographic information about European countries for the year 2007.

# A tibble: 30 x 6
  country continent  year lifeExp      pop gdpPercap
                      
1 Albania Europe     2007    76.4  3600523     5937.
2 Austria Europe     2007    79.8  8199783    36126.
3 Belgium Europe     2007    79.4 10392226    33693.
# … with 27 more rows

Create a subset of the dataset and compare the GDP per capita of the European countries for the year 2007:

  1. Apply the select() function on the gapminder_europe_2007 tibble.
  2. Keep the columns country and gdpPercap, but use only the index of the columns (1and 6) for this step.

Note that the dplyr package is already loaded.

Start Exercise

Select columns from a data frame is an excerpt from the course Introduction to R, which is available for free at quantargo.com

VIEW FULL COURSE

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)