Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After understanding “how to subset columns data in R“; this article aims to demonstrate row subsetting using base R and the “dplyr” package. Let’s see how to subset rows from a data frame in R and the flow of this article is as follows:

Prerequisites:

1. R
2. R Studio (for ease)

Data

We will use s and p 500 companies financials data to demonstrate row data subsetting. Interestingly, this data is available under the PDDL licence. For this article, we have altered the original file to have a limited number of columns only. However, we have a great post explaining how to prepare data for analysis in R in 5 steps using multiple CSV files where we have split the original file into multiple files and combined them to produce an original result. The CSV file we are using in this article is a result of how to prepare data for analysis in R in 5 steps article.

Before importing data into R, it is important to know that R keeps the data into the memory of your computer. As a result, performance may be impacted. In other words, we should calculate memory requirements in advance.

Let’s import constituents-financials_csv.csv file into R using read.csv function.

financials <- read.csv("constituents-financials_csv.csv")

The command above has created a new object called financials. This object is a data frame which contains all the data from financials CSV file. Let’s look at the structure of the newly created data frame.

str(financials)
## 'data.frame':    505 obs. of  6 variables:
##  $Symbol : chr "MMM" "AOS" "ABT" "ABBV" ... ##$ Name          : chr  "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ...
##  $Sector : chr "Industrials" "Industrials" "Health Care" "Health Care" ... ##$ Price         : num  222.9 60.2 56.3 108.5 150.5 ...
##  $Price.Earnings: num 24.3 27.8 22.5 19.4 25.5 ... ##$ Dividend.Yield: num  2.33 1.15 1.91 2.5 1.71 ...

Interestingly, the result of str(financials) command shows that there are 505 observations (rows) and 14 variables (columns). Now let’s look at different ways of row subsetting from a data frame.

Subset nth row from a data frame

Using base R

Interestingly, if you type financials on the console you would find that R will display several observations from the data frame financials. The result could be overwhelming. Let’s have a look at the result and observe. To save space head function is used here with argument 2 to show only two observations.

head(financials,2)
##   Symbol            Name      Sector  Price Price.Earnings Dividend.Yield
## 1    MMM      3M Company Industrials 222.89          24.31       2.332862
## 2    AOS A.O. Smith Corp Industrials  60.24          27.76       1.147959

The number next to the two # symbols identifies the row uniquely. This number is known as the index. To select an nth row we have to supply the number of the row in bracket notation. Here is the example where we are selecting the 7th row of financials data frame:

financials[7,]
##   Symbol              Name      Sector  Price Price.Earnings Dividend.Yield
## 7    AYI Acuity Brands Inc Industrials 145.41          18.22      0.3511853

Square bracket notation is one way of subsetting data from a data frame. We should be careful when selecting rows and columns; especially the position of the expression concerning a comma. Expression left side of the comma is operated upon observations (rows) and on the other hand, expression at the right side of the comma is operated upon variables (columns).

Using the dplyr package

Following is the nth row selection using the dplyr package. You can see the result is identical. Function "row" used below returns a matrix of an integer as the row number or index.

library(dplyr)
financials %>% filter(row(financials) == 7)
##   Symbol              Name      Sector  Price Price.Earnings Dividend.Yield
## 1    AYI Acuity Brands Inc Industrials 145.41          18.22      0.3511853

Subset range of rows from a data frame

Using base R

It is interesting to know that we can select any row by just supplying the number or the index of that row with square brackets to get the result. Similarly, we can retrieve the range of rows as well. This can be done by simply providing the range in square brackets notations. Let’s look at the example by selecting 3 rows starting from 2nd row.

financials[2:4,]
##   Symbol                Name      Sector  Price Price.Earnings Dividend.Yield
## 2    AOS     A.O. Smith Corp Industrials  60.24          27.76       1.147959
## 3    ABT Abbott Laboratories Health Care  56.27          22.51       1.908982
## 4   ABBV         AbbVie Inc. Health Care 108.48          19.41       2.499560

But what is the use of this functionality?

Well, in complex scenario you may want to derive the range based on different conditions and observe the result. Interestingly, a range can also be defined within square brackets using concatenation function to select a range of rows from a data frame. Here is an example. Result for both the commands is identical.

financials[c(2:4),]
##   Symbol                Name      Sector  Price Price.Earnings Dividend.Yield
## 2    AOS     A.O. Smith Corp Industrials  60.24          27.76       1.147959
## 3    ABT Abbott Laboratories Health Care  56.27          22.51       1.908982
## 4   ABBV         AbbVie Inc. Health Care 108.48          19.41       2.499560

Using the dplyr function

financials %>% filter(row(financials) >= 2 & row(financials) <= 4)
##   Symbol                Name      Sector  Price Price.Earnings Dividend.Yield
## 1    AOS     A.O. Smith Corp Industrials  60.24          27.76       1.147959
## 2    ABT Abbott Laboratories Health Care  56.27          22.51       1.908982
## 3   ABBV         AbbVie Inc. Health Care 108.48          19.41       2.499560

Though using the dplyr package for this purpose is an overkill. We must note that the same operation can be performed in many ways.

Conditional subsetting of rows from a data frame

The subsetting we have seen in two sections above is simple but in real-world, the requirement may be based on complex conditions. In this section, our emphasis will be on conditional subsetting of the rows from a data frame.

Scenario

Let’s retrieve all the rows from the financials data frame where variable (column) symbol is “ACN” or “APTV” and the variable sector is either “Industrials” or “Consumer Discretionary”.

Using base R

financials[(financials$Symbol == "ACN" | financials$Symbol == "APTV") & (financials$Sector == "Consumer Discretionary" | financials$Sector == "Industrials"),]
##    Symbol      Name                 Sector Price Price.Earnings Dividend.Yield
## 54   APTV Aptiv Plc Consumer Discretionary 89.27          69.74  

We can see that only one row falls into the subsetting criterion. Notice the use of “&”, “|” and “==” operators to define “AND”, “OR” and equality conditions respectively. Additionally, note that full subsetting criterion is written on the left side of the comma as we are filtering only rows.

Using the dplyr package

The dplyr package offers great functionality when it comes to filtering observations (rows). Let’s look at the same scenario by using dplyr function.

financials %>% filter((Symbol == "ACN" | Symbol == "APTV") & (Sector == "Consumer Discretionary" | Sector == "Industrials"))
##   Symbol      Name                 Sector Price Price.Earnings Dividend.Yield
## 1   APTV Aptiv Plc Consumer Discretionary 89.27          69.74      0.9392678

The result is identical and the difference is clear that we do not have to use the “\$” subscript to define the column while applying the conditions. This is a normal notation when using the dplyr package.

I hope you enjoyed the read. Please do not forget to share and comment below. If you wish, you can download the R code file containing all commands listed here at our GitHub repository.

Image Credit unsplash.com