Select columns from a data frame
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To select only a specific set of interesting data frame columns dplyr offers the select()
function to extract columns by names, indices and ranges. You can even rename extracted columns with select()
.
- Learn to use the
select()
function - Select columns from a data frame by name or index
- Rename columns from a data frame
select(my_data_frame, column_one, column_two, ...) select(my_data_frame, new_column_name = current_column, ...) select(my_data_frame, column_start:column_end) select(my_data_frame, index_one, index_two, ...) select(my_data_frame, index_start:index_end)
Selecting by name
select(my_data_frame, column_one, column_two, ...) select(my_data_frame, new_column_name = current_column, ...) select(my_data_frame, column_start:column_end) select(my_data_frame, index_one, index_two, ...) select(my_data_frame, index_start:index_end)
In this chapter we will have a look at the pres_results
dataset from the politicaldata package. It contains data about US presidential elections since 1976, converted to a Tibble for nicer printing.
# A tibble: 561 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 1976 AK 123574 0.357 0.579 0.0549 2 1976 AL 1182850 0.557 0.426 0.0163 3 1976 AR 767535 0.650 0.349 0.00134 # … with 558 more rows
For this example, we will have a look at the number of total votes in different states at different elections. Since we are only interested in the number of people who voted we would like to create a custom version of the pres_results
data frame that only contains the columns year
, state
and total_votes
. For such filtering, we can use the select()
fiction from the dplyr package.
The select()
function takes a data frame as an input parameter and lets us decide which of the columns we want to keep from it. The output of the function is a data frame with all rows, but containing only the columns we explicitly select.
We can reduce our dataset to only year
, state
and total_votes
in the following way:
select(pres_results, year, state, total_votes) # A tibble: 561 x 3 year state total_votes <dbl> <chr> <dbl> 1 1976 AK 123574 2 1976 AL 1182850 3 1976 AR 767535 # … with 558 more rows
As the first parameter we passed the pres_results
data frame, as the remaining parameters we passed the columns we want to keep to select()
.
Apart from keeping the columns we want, the select()
function also keeps them in the same order as we specified in the function parameters.
If we change the order of the parameters when we call the function, the columns of the output change accordingly:
select(pres_results, total_votes, year, state) # A tibble: 561 x 3 total_votes year state <dbl> <dbl> <chr> 1 123574 1976 AK 2 1182850 1976 AL 3 767535 1976 AR # … with 558 more rows
Exercise: Life expectancy in Austria
The gapminder_austria
dataset contains information about the economic and demographic change in Austria over the last decades. To inspect how the life expectancy in Austria changed over time, create a subset of the tibble that contains only the necessary columns for this task:
- Use the dplyr
select()
function and definegapminder_austria
as the input tibble. - Keep only the columns
year
andlifeExp
in the output dataset.
Note that the dplyr package is already loaded.
Start ExerciseRenaming columns
select(my_data_frame, column_one, column_two, ...) select(my_data_frame, new_column_name = current_column, ...) select(my_data_frame, column_start:column_end) select(my_data_frame, index_one, index_two, ...) select(my_data_frame, index_start:index_end)
In addition to defining the columns we want keep, we can also rename them. To do this, we need to set the new column name inside the select()
function using the command
new_column_name = current_column
In the following example, we select the columns year
, state
and total_votes
but rename the year
column to Election
in the output:
select(pres_results, Election = year, state, total_votes) # A tibble: 561 x 3 Election state total_votes <dbl> <chr> <dbl> 1 1976 AK 123574 2 1976 AL 1182850 3 1976 AR 767535 # … with 558 more rows
Exercise: Rename columns
The gapminder_india
dataset contains information about the economic and demographic change in India over the last decades. Inspect how the population in India changed over time:
- Use the
select()
function and definegapminder_india
as the input tibble. - Keep only the columns
year
andpop
. - Rename the
pop
column topopulation
in the output tibble.
Note that the dplyr package is already loaded.
Start ExerciseSelecting by name range
select(my_data_frame, column_one, column_two, ...) select(my_data_frame, new_column_name = current_column, ...) select(my_data_frame, column_start:column_end) select(my_data_frame, index_one, index_two, ...) select(my_data_frame, index_start:index_end)
When we use the select()
function and define the columns we want to keep, dplyr does not actually use the name of the columns but the index of the columns in the data frame. This means, when we define the first three columns of the pres_results
data frame, year
, state
and total_votes
, dplyr converts these names to the index values 1
, 2
and 3
. We can therefore also use the name of the columns, apply the :
operator and define ranges of columns, that we want to keep:
select(pres_results, year:total_votes) # A tibble: 561 x 3 year state total_votes <dbl> <chr> <dbl> 1 1976 AK 123574 2 1976 AL 1182850 3 1976 AR 767535 # … with 558 more rows
What the year:total_votes
does, can be translated to 1:3
, which is simply creating a vector of numerical values from 1 to 3. Then, the select()
function takes the pres_results
data frame and outputs a subset of it, keeping only the first three columns.
Exercise: Select a name range
The gapminder_europe_2007
dataset contains economic and demographic information about European countries for the year 2007:
# A tibble: 30 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Albania Europe 2007 76.4 3600523 5937. 2 Austria Europe 2007 79.8 8199783 36126. 3 Belgium Europe 2007 79.4 10392226 33693. # … with 27 more rows
Create a subset of the tibble and compare the life expectancy in different European countries for the year 2007:
- Apply the
select()
function on thegapminder_europe_2007
tibble. - Use the
:
operator and select the columns fromcountry
tolifeExp
.
Note that the dplyr package is already loaded.
Start ExerciseSelect() by indices
select(my_data_frame, column_one, column_two, ...) select(my_data_frame, new_column_name = current_column, ...) select(my_data_frame, column_start:column_end) select(my_data_frame, index_one, index_two, ...) select(my_data_frame, index_start:index_end)
The select()
function can be used with column indices as well. Instead of using names we need to specify the columns we want to select by their indices. Compared to other programming languages the indexing in R starts with one instead of zero. To select the first, fourth and fifth column from the pres_results
dataset we can write
select(pres_results, 1,4,5) # A tibble: 561 x 3 year dem rep <dbl> <dbl> <dbl> 1 1976 0.357 0.579 2 1976 0.557 0.426 3 1976 0.650 0.349 # … with 558 more rows
Similarly to defining ranges of columns using their names, we can define ranges (or vectors) of index values instead:
select(pres_results, 1:3) # A tibble: 561 x 3 year state total_votes <dbl> <chr> <dbl> 1 1976 AK 123574 2 1976 AL 1182850 3 1976 AR 767535 # … with 558 more rows
Exercise: Select by indices
The gapminder_europe_2007
dataset contains economic and demographic information about European countries for the year 2007.
# A tibble: 30 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Albania Europe 2007 76.4 3600523 5937. 2 Austria Europe 2007 79.8 8199783 36126. 3 Belgium Europe 2007 79.4 10392226 33693. # … with 27 more rows
Create a subset of the dataset and compare the GDP per capita of the European countries for the year 2007:
- Apply the
select()
function on thegapminder_europe_2007
tibble. - Keep the columns
country
andgdpPercap
, but use only the index of the columns (1
and6
) for this step.
Note that the dplyr package is already loaded.
Start ExerciseSelect columns from a data frame is an excerpt from the course Introduction to R, which is available for free at quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.