How to subset a data frame column data in R

[This article was first published on dataENQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this article, we present the audience with different ways of subsetting data from a data frame column using base R and dplyr. Let’s check out how to subset a data frame column data in R. The summary of the content of this article is as follows:


  1. Data
  2. Reading Data
  3. Subset a data frame column data
  4. Subset all data from a data frame
  5. Subset column from a data frame
  6. Subset multiple columns from a data frame
  7. Subset all columns data but one from a data frame
  8. Subset columns which share same character or string at the start of their name

Prerequisites:

  1. R
  2. R Studio (for ease)

Assumption: Working directory is set and datasets are stored in the working directory. setwd() command is used to set the working directory. Supply the path of directory enclosed in double quotes to set it as a working directory.

Data

We will use s and p 500 companies financials data to demonstrate row data subsetting. Interestingly, this data is available under the PDDL licence. 

We have a great post explaining how to prepare data for analysis in R in 5 steps using multiple CSV files where we have split the original file into multiple files and combined them to produce an original result. The CSV file we are using in this article is a result of how to prepare data for analysis in R in 5 steps article.

Reading data

Data can come from any source, it can be a flat file, database system, or handwritten notes. Usually, flat files are the most common source of the data. In this section, we will see how to load data from a CSV file. Most importantly, if we are working with a large dataset then we must check the capacity of our computer as R keep the data into memory. Let’s read the CSV file into R.

financials <- read.csv("constituents-financials_csv.csv")   

The command above will import the content of the constituents-financials_csv.csv file into an object called the financials. Object financials is a data frame that contains all the data from the constituents-financials_csv.csv file. If you have a relation database experience then we can loosely compare this to a relational database object “table”. Similar to tables, data frames also have rows and columns, and data is presented in rows and columns form.

To clarify, function read.csv above take multiple other arguments other than just the name of the file. Information on additional arguments can be found at read.csv. Let’s continue learning how to subset a data frame column data in R.

Subset a data frame column data

Before we learn how to subset columns data in R from a data frame "financials", I would recommend learning the following three functions using "financials" data frame:

names(financials)
##  [1] "Symbol"         "Name"           "Sector"         "Price"         
##  [5] "Price.Earnings" "Dividend.Yield" "Earnings.Share" "X52.Week.Low"  
##  [9] "X52.Week.High"  "Market.Cap"     "EBITDA"         "Price.Sales"   
## [13] "Price.Book"     "SEC.Filings"

Command names(financials) above would return all the column names of the data frame. Checking column names just after loading the data is useful as this will make you familiar with the data frame. Do not worry about the numbers in the square brackets just yet, we will look at them in a future article. The names of the columns are listed next to the numbers in the brackets and there are a total of 14 columns in the financials data frame.

str(financials)
## 'data.frame':    505 obs. of  14 variables:
##  $ Symbol        : chr  "MMM" "AOS" "ABT" "ABBV" ...
##  $ Name          : chr  "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ...
##  $ Sector        : chr  "Industrials" "Industrials" "Health Care" "Health Care" ...
##  $ Price         : num  222.9 60.2 56.3 108.5 150.5 ...
##  $ Price.Earnings: num  24.3 27.8 22.5 19.4 25.5 ...
##  $ Dividend.Yield: num  2.33 1.15 1.91 2.5 1.71 ...
##  $ Earnings.Share: num  7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ...
##  $ X52.Week.Low  : num  259.8 68.4 64.6 125.9 162.6 ...
##  $ X52.Week.High : num  175.5 48.9 42.3 60 114.8 ...
##  $ Market.Cap    : num  1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ...
##  $ EBITDA        : num  9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ...
##  $ Price.Sales   : num  4.39 3.58 3.74 6.29 2.6 ...
##  $ Price.Book    : num  11.34 6.35 3.19 26.14 10.62 ...
##  $ SEC.Filings   : chr  "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ...

Command str(financials) would return the structure of the data frame. Function str() compactly displays the internal structure of the object, be it data frame or any other. Above is the structure of the financials data frame. The result from str() function above shows the data type of the columns financials data frame has, as well as sample data from the individual columns.

dim(financials)
## [1] 505  14

Command dim(financials) mentioned above will result in dimensions of the financials data frame or in other words total number of rows and columns this data frame has. In statistics terms, a column is a variable and row is an observation. Data frame financials has 505 observations and 14 variables.

Subset all data from a data frame

In base R, just putting the name of the data frame financials on the prompt will display all of the data for that data frame. Commands head(financials) or head(financials, 10), 10 is just to show the parameter that head function can take which limit the number of lines. Similarly, tail(financials) or tail(financials, 10) will be helpful to quickly check the data from the end.

Subset column from a data frame

In base R, you can specify the name of the column that you would like to select with $ sign (indexing tagged lists) along with the data frame. The command head(financials$Population, 10) would show the first 10 observations from column Population from data frame financials:

head(financials$Name, 10)
##  [1] "3M Company"                 "A.O. Smith Corp"           
##  [3] "Abbott Laboratories"        "AbbVie Inc."               
##  [5] "Accenture plc"              "Activision Blizzard"       
##  [7] "Acuity Brands Inc"          "Adobe Systems Inc"         
##  [9] "Advance Auto Parts"         "Advanced Micro Devices Inc"

What we have done above can also be done using dplyr package. As per rdocumentation.org “dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Here is a command using dplyr package which selects Population column from the financials data frame:

You can see the presentation of the result between subsetting using $ sign (element names operator) and using dplyr package.

Subsetting multiple columns from a data frame

Using base R

The following command will help subset multiple columns. In the command below first two columns are selected from the data frame financials. If you see the result for command names(financials) above, you would find that "Symbol" and "Name" are the first two columns.

head(financials[,c(1,2)],10)
##    Symbol                       Name
## 1     MMM                 3M Company
## 2     AOS            A.O. Smith Corp
## 3     ABT        Abbott Laboratories
## 4    ABBV                AbbVie Inc.
## 5     ACN              Accenture plc
## 6    ATVI        Activision Blizzard
## 7     AYI          Acuity Brands Inc
## 8    ADBE          Adobe Systems Inc
## 9     AAP         Advance Auto Parts
## 10    AMD Advanced Micro Devices Inc

Or we can supply the name of the columns and select them. Here is an example:

head(financials[,c("Symbol","Name")],10)
##    Symbol                       Name
## 1     MMM                 3M Company
## 2     AOS            A.O. Smith Corp
## 3     ABT        Abbott Laboratories
## 4    ABBV                AbbVie Inc.
## 5     ACN              Accenture plc
## 6    ATVI        Activision Blizzard
## 7     AYI          Acuity Brands Inc
## 8    ADBE          Adobe Systems Inc
## 9     AAP         Advance Auto Parts
## 10    AMD Advanced Micro Devices Inc

Using dplyr package select clause

result <- financials %>% select(Symbol, Name)
head(result, 10)
##    Symbol                       Name
## 1     MMM                 3M Company
## 2     AOS            A.O. Smith Corp
## 3     ABT        Abbott Laboratories
## 4    ABBV                AbbVie Inc.
## 5     ACN              Accenture plc
## 6    ATVI        Activision Blizzard
## 7     AYI          Acuity Brands Inc
## 8    ADBE          Adobe Systems Inc
## 9     AAP         Advance Auto Parts
## 10    AMD Advanced Micro Devices Inc

Using base R

Any number of columns can be selected this way by giving the number or the name of the column within a vector. Let’s find out the first, fourth, and eleventh column from the financials data frame. Remember, instead of the number you can give the name of the column enclosed in double-quotes:

head(financials[,c(1,4,11)],10)
##    Symbol  Price      EBITDA
## 1     MMM 222.89  9048000000
## 2     AOS  60.24   601000000
## 3     ABT  56.27  5744000000
## 4    ABBV 108.48 10310000000
## 5     ACN 150.51  5643228000
## 6    ATVI  65.83  2704000000
## 7     AYI 145.41   587800000
## 8    ADBE 185.16  2538040000
## 9     AAP 109.63   853941000
## 10    AMD  11.22   339000000

Using dplyr package select clause

result <- financials %>% select(Symbol, Price, EBITDA)
head(result, 10)
##    Symbol  Price      EBITDA
## 1     MMM 222.89  9048000000
## 2     AOS  60.24   601000000
## 3     ABT  56.27  5744000000
## 4    ABBV 108.48 10310000000
## 5     ACN 150.51  5643228000
## 6    ATVI  65.83  2704000000
## 7     AYI 145.41   587800000
## 8    ADBE 185.16  2538040000
## 9     AAP 109.63   853941000
## 10    AMD  11.22   339000000

Subset all columns but one from a data frame

This approach is called subsetting by the deletion of entries. In base R you can specify which column you would like to exclude from the selection by putting a minus sign in from of it. Let’s try:

result <- head(financials[,-6],10)
dim(result)
## [1] 10 13

Now if we analyse the result of the above command, we can see the dimension of the result variable is showing 10 observations (rows) and 13 variables (columns). If you check the result of command dim(financials) above, you can see there were total 14 variables in the financials data frame but as we have excluded the sixth column using -6 in column section in command result <- head(financials[,-6],10) which returned a result for all columns except sixth. A similar operation can be performed using dplyr package and instead of using the minus sign on the number of a column, you can use it directly on the name of the column. Here is the example where we would exclude column “EBITDA” form the result set:

result <- select(financials, -EBITDA)
dim(head(result,10))
## [1] 10 13

Subsetting all columns which start with a particular character or string

If you go back to the result of names(financials) command you would see that few column names start with the same string. Columns we particularly interested in here start with word “Price”. Following R command using dplyr package will help us subset these two columns by writing as little code as possible. Imagine a scenario when you have several columns which start with the same character or string and in such scenario following command will be helpful:

result <- select(financials, starts_with("Price"))
head(result, 10)
##     Price Price.Earnings Price.Sales Price.Book
## 1  222.89          24.31    4.390271      11.34
## 2   60.24          27.76    3.575483       6.35
## 3   56.27          22.51    3.740480       3.19
## 4  108.48          19.41    6.291571      26.14
## 5  150.51          25.47    2.604117      10.62
## 6   65.83          31.80   10.595120       5.16
## 7  145.41          18.22    1.795347       3.55
## 8  185.16          52.31   13.092818      11.06
## 9  109.63          19.54    1.130106       2.51
## 10  11.22         187.00    2.109195      21.47

I hope you enjoyed this post and learned how to subset a data frame column data in R. If it helped you in any way then please do not forget to share this post.

Image Credit unsplash.com

To leave a comment for the author, please follow the link and comment on their blog: dataENQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)