R:case4base – data filtering and ordering with base R

April 20, 2018
By

(This article was first published on R blog, and kindly contributed to R-bloggers)

Introduction

In the previous article we discussed and learned how to reshape data with base R to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data and sorting the data to suit our needs. The key will be safety and avoiding complication and confusion as much as possible. This is why we:

  • try to avoid using subset, as this function is implemented via non-standard evaluation.
  • also skip $ as it uses partial matching and is hardly used with variables as column names.
  • not mention more details related to list properties of data.frames here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the subsetting chapter of Hadley Wickham’s Advanced R

How to use this article

  1. This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a case4base after all!
  2. If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozefhajnala.gitlab.io/r/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:6, drop = FALSE])
##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0

Selecting (subsetting) relevant data from a data.frame

In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting data.frames can be summarised:

dataframe_name[row_subset, col_subset, drop = FALSE]

Where:

  • dataframe_name is the name of the data.frame we are subsetting
  • row_subset is a vector specifying the subset of rows
  • col_subset is a vector specifying the subset of columns
  • drop = FALSE is to make sure the result does not get simplyfied when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)

Constructing meaningful subsets simply and safely

In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:

Let us now subset the rows of our data to get the data for countries that have known (not NA) value in the year 2016 and this value is less than 1 million:

rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
gdi[rowidx, c(1, 23), drop = FALSE]
##        country    Y.2016
## 3      Belgium 243825.50
## 4     Bulgaria  60237.00
## 8      Estonia  12548.30
## 9      Ireland  97318.90
## 11       Spain 698701.00
## 13     Croatia      0.00
## 16      Latvia  15737.79
## 17   Lithuania  24743.49
## 18  Luxembourg  20155.80
## 21 Netherlands 357383.00
## 22     Austria 214980.60
## 24    Portugal 128789.39
## 26    Slovenia  24756.63
## 27    Slovakia  48882.91
## 28     Finland 126590.00
## 33 Switzerland 458641.00

Note that when creating the rowidx we omitted the drop = FALSE despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.

More ways to provide subset indices

Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:

  1. Logical vectors TRUE for rows/columns to subset, FALSE for those to omit
st1 <- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8))
         , c(TRUE, rep(FALSE, 20), rep(TRUE, 2))
         , drop = FALSE
         ]
  1. Numeric vectors of row/column numbers to subset
st2 <- gdi[c(1:2, 27) 
         , c(1, 22:23)
         , drop = FALSE
         ]
  1. Negative numeric vectors of row/column numbers to omit
st3 <- gdi[c(-3:-26, -28:-35)
         , c(-2:-21)
         , drop = FALSE
         ]
  1. Character vectors of row/column names to subset
st4 <- gdi[c("1", "2", "27") # we do not have very meaningful rownames
         , c("country", "Y.2015", "Y.2016")
         , drop = FALSE
         ]
st4
##         country     Y.2015     Y.2016
## 1         EU 28 9439578.39 9454683.60
## 2  Euro area 19 6598231.27 6736686.43
## 27     Slovakia   47464.71   48882.91
  1. All of the above give identical results
identical(st1, st2) && identical(st2, st3) && identical(st3, st4)
## [1] TRUE

Tips

  1. The above methods are also working and safe for matrices
  2. Negative and positive numeric vectors cannot be combined

Sorting data

The concept of subsetting can easily be generalized for sorting purposes. We can easily see this by the following command giving us identical data.frame to GDI:

identical(gdi[0:nrow(gdi), 0:ncol(gdi), drop = FALSE], gdi)
## [1] TRUE

To get the rows and columns in reverse order:

gdi_reversed <- gdi[nrow(gdi):0, ncol(gdi):0, drop = FALSE]

To get the order by GDI in 2016, we simply find the rowidx with order and subset with it:

rowidx <- order(gdi[, "Y.2016"])
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA

That looks good, but we may want to order the rows that have NA as GDI in 2016 alphabetically by country (or generalize even further). To use multiple vectors for ordering is also very simple:

rowidx <- order(gdi[, "Y.2016"], gdi[, "country"])
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA

To order in descending order, we can use decreasing = TRUE, to see NAs first we can use na.last = FALSE

rowidx <- order(gdi[, "Y.2016"], decreasing = TRUE, na.last = FALSE)
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00

Sorting by multiple vectors with different order

To order by multiple columns differently, for numeric vectors we can use a simple -, since negated numeric vector will order in reverse order. To order our GDI dataset by GDI in 2016 descending and then by country alphabetically:

rowidx <- order(-gdi[, "Y.2016"], gdi[, "country"])
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA

For non-numeric vectors, we can take advantage of the xtfrm function, which returns a numeric vector which will sort in the same order as the one provided to it. Then we just use - to get a vector that will order in reverse order. To order our GDI dataset by GDI ascending in 2016 and then by country reverse-alphabetically:

rowidx <- order(gdi[, "Y.2016"], -xtfrm(gdi[, "country"]))
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 34         Turkey         NA
## 35         Serbia         NA
## 25        Romania         NA
## 20          Malta         NA
## 31        Iceland         NA
## 19        Hungary         NA
## 10         Greece         NA
## 15         Cyprus         NA

Alternatives to base R

TL;DR – Just want the code

No time for reading? Click here to get just the code with commentary

Exercises

  1. What is the difference between gdi[3, 3] and gdi[3, 3, drop = FALSE] ?
  2. What is the difference between gdi[-3, 3] and gdi[3, -3] ? What about gdi[-3, 3, drop = FALSE] ?
  3. Why does rowidx <- order(-gdi[, "country"]) not work as expected? Can you fix?
  4. Why cannot we omit the first part of the & in rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000. What would happen if we just did rowidx <- gdi[, "Y.2016"] < 1000000 ?
  5. Bonus question 1: Why is identical(gdi[, "Y.2016", drop = FALSE], gdi["Y.2016"])
  6. Bonus question 2: Why is identical(gdi[, "Y.2016"], gdi[["Y.2016"]])

References

  1. Advanced R’s chapter on subsetting
  2. and on data types

Discuss the article


To leave a comment for the author, please follow the link and comment on their blog: R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)