R:case4base – Sorting data with base R

December 8, 2018
By

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In this post in the R:case4base series we will examine sorting (ordering) data in base R. We will learn to sort our data based on one or multiple columns, with ascending or descending order and as always look at alternatives to base R, namely the tidyverse’s dplyr and data.table to show how we can achieve the same results.

It is recommended to first have a look at the post on subsetting to understand the concepts underlying the sorting process in more detail.

How to use this article

  1. This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a case4base after all!
  2. If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

gdi <- read.csv(
  stringsAsFactors = FALSE
, url("https://jozefhajnala.gitlab.io/r/post/data/ESA2010_GDI.csv")
              )
head(gdi[, 1:6, drop = FALSE])
##          country   Y.1995    Y.1996    Y.1997    Y.1998    Y.1999
## 1          EU 28       NA        NA        NA        NA 5982392.8
## 2   Euro area 19       NA        NA        NA        NA 4393727.3
## 3        Belgium 140734.1  141599.4  145023.2  149705.2  153804.0
## 4       Bulgaria   1036.0    1468.1   12367.4   14921.1   16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6        Denmark 566363.0  578102.0  591416.0  621236.0  614893.0

Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.

The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.

Subsetting as a mechanism for sorting data

Sorting a data frame is loosely coupled with subsetting. To get the rows of a data frame in order reverse to the current one, we can just subset the rows with an index that goes from the last row to the very first (or safer, zeroth) like so:

gdi_reversed_rows <- gdi[nrow(gdi):0, ]

We can take a very similar approach to reverse order the columns:

gdi_reversed_cols <- gdi[, ncol(gdi):0]

Or both rows and columns at the same time. We also add the drop = FALSE for safety here as we omitted it in the 2 above examples for readability:

gdi_reversed <- gdi[nrow(gdi):0, ncol(gdi):0, drop = FALSE]
head(gdi_reversed)
##     Y.2016  Y.2015    Y.2014    Y.2013    Y.2012    Y.2011    Y.2010
## 35      NA      NA        NA        NA        NA        NA        NA
## 34      NA 1631795 1438281.4 1268729.8 1081744.9  971545.3  807128.5
## 33  458641  447094  449119.3  437596.6  428131.2  420404.9  412363.1
## 32 1627136 1606745 1496128.0 1419380.0 1347970.0 1272065.0 1204442.0
## 31      NA      NA 1055733.5  980494.9  934077.3  872900.3  798916.7
## 30 1330854 1298475 1269177.0 1219699.0 1195227.0 1160813.0 1151812.0
##       Y.2009    Y.2008    Y.2007    Y.2006   Y.2005   Y.2004   Y.2003
## 35        NA        NA        NA        NA       NA       NA       NA
## 34  689431.6        NA        NA        NA       NA       NA       NA
## 33  404446.9  399834.1  389468.0  368868.0 352620.1 341709.9 337742.9
## 32 1150829.0 1105563.0 1021911.0  943515.0 975153.0 894892.0 854026.0
## 31  858678.9  909995.1  827339.5  681058.3 631210.9 536194.9 478645.8
## 30 1101109.0 1080225.0 1063178.0 1005630.0 966175.0 926670.0 893528.0
##      Y.2002   Y.2001   Y.2000   Y.1999   Y.1998   Y.1997   Y.1996   Y.1995
## 35       NA       NA       NA       NA       NA       NA       NA       NA
## 34       NA       NA       NA       NA       NA       NA       NA       NA
## 33 335845.6 336581.4 326269.3 312478.7 303239.5 296324.6 291208.4 287865.4
## 32 800130.0 727228.0 704697.0 660196.0 630865.0 582597.0 549694.0 522981.0
## 31 447572.6 400145.0 369181.0       NA       NA       NA       NA       NA
## 30 857352.0 829908.0 789615.0 737419.0 715396.0 691951.0 656455.0 618959.0
##           country
## 35         Serbia
## 34         Turkey
## 33    Switzerland
## 32         Norway
## 31        Iceland
## 30 United Kingdom

Sorting data by contents of a column

To order the rows (countries) by GDI in 2016, we use the function order, which finds the permutation that rearranges the values into ascending order and save that order into a variable called rowidx. Then we simply use rowidx to subset the rows of gdi in the order we wanted:

rowidx <- order(gdi[, "Y.2016"])
rowidx
##  [1] 13  8 16 18 17 26 27  4  9 28 24 22  3 21 33 11  6 23 14 30 12 32  7
## [24] 29  5  2  1 10 15 19 20 25 31 34 35
gdi_sorted <- gdi[rowidx, , drop = FALSE]

# We can of course do it in one go:
gdi_sorted <- gdi[order(gdi[, "Y.2016"]), , drop = FALSE]

# Look at the 2 relevant columns of the result 
gdi_sorted[, c(1, 23)]
##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA

To order in descending order, we can use decreasing = TRUE, to see NAs first we can use na.last = FALSE

rowidx <- order(gdi[, "Y.2016"], decreasing = TRUE, na.last = FALSE)
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 10         Greece         NA
## 15         Cyprus         NA
## 19        Hungary         NA
## 20          Malta         NA
## 25        Romania         NA
## 31        Iceland         NA
## 34         Turkey         NA
## 35         Serbia         NA
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00

Sorting by multiple vectors with different order

That looks good, but we may want to order the rows that have NA as GDI in 2016 alphabetically by country (or generalize even further). To use multiple vectors for ordering is also very simple:

rowidx <- order(gdi[, "Y.2016"], gdi[, "country"])
head(gdi[rowidx, c(1, 23), drop = FALSE])
##       country   Y.2016
## 13    Croatia     0.00
## 8     Estonia 12548.30
## 16     Latvia 15737.79
## 18 Luxembourg 20155.80
## 17  Lithuania 24743.49
## 26   Slovenia 24756.63

To order by multiple columns differently, for numeric vectors we can use a simple -, since negated numeric vector will order in reverse order. To order our GDI dataset by GDI in 2016 descending and then by country alphabetically:

rowidx <- order(-gdi[, "Y.2016"], gdi[, "country"])
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 1           EU 28 9454683.60
## 2    Euro area 19 6736686.43
## 5  Czech Republic 2523229.00
## 29         Sweden 2402587.00
## 7         Germany 2019917.00
## 32         Norway 1627136.00
## 12         France 1425435.00
## 30 United Kingdom 1330854.00
## 14          Italy 1142273.40
## 23         Poland 1136916.00
## 6         Denmark 1091542.00
## 11          Spain  698701.00
## 33    Switzerland  458641.00
## 21    Netherlands  357383.00
## 3         Belgium  243825.50
## 22        Austria  214980.60
## 24       Portugal  128789.39
## 28        Finland  126590.00
## 9         Ireland   97318.90
## 4        Bulgaria   60237.00
## 27       Slovakia   48882.91
## 26       Slovenia   24756.63
## 17      Lithuania   24743.49
## 18     Luxembourg   20155.80
## 16         Latvia   15737.79
## 8         Estonia   12548.30
## 13        Croatia       0.00
## 15         Cyprus         NA
## 10         Greece         NA
## 19        Hungary         NA
## 31        Iceland         NA
## 20          Malta         NA
## 25        Romania         NA
## 35         Serbia         NA
## 34         Turkey         NA

For non-numeric vectors, we can take advantage of the xtfrm function, which returns a numeric vector which will sort in the same order as the one provided to it. Then we just use - to get a vector that will order in reverse order. To order our GDI dataset by GDI ascending in 2016 and then by country reverse-alphabetically:

rowidx <- order(gdi[, "Y.2016"], -xtfrm(gdi[, "country"]))
gdi[rowidx, c(1, 23), drop = FALSE]
##           country     Y.2016
## 13        Croatia       0.00
## 8         Estonia   12548.30
## 16         Latvia   15737.79
## 18     Luxembourg   20155.80
## 17      Lithuania   24743.49
## 26       Slovenia   24756.63
## 27       Slovakia   48882.91
## 4        Bulgaria   60237.00
## 9         Ireland   97318.90
## 28        Finland  126590.00
## 24       Portugal  128789.39
## 22        Austria  214980.60
## 3         Belgium  243825.50
## 21    Netherlands  357383.00
## 33    Switzerland  458641.00
## 11          Spain  698701.00
## 6         Denmark 1091542.00
## 23         Poland 1136916.00
## 14          Italy 1142273.40
## 30 United Kingdom 1330854.00
## 12         France 1425435.00
## 32         Norway 1627136.00
## 7         Germany 2019917.00
## 29         Sweden 2402587.00
## 5  Czech Republic 2523229.00
## 2    Euro area 19 6736686.43
## 1           EU 28 9454683.60
## 34         Turkey         NA
## 35         Serbia         NA
## 25        Romania         NA
## 20          Malta         NA
## 31        Iceland         NA
## 19        Hungary         NA
## 10         Greece         NA
## 15         Cyprus         NA

Alternatives to base R

Using the tidyverse

The dplyr package comes with a set of very user-friendly functions that are very easy to use, especially in an interactive setting where we know the column names up front, so we can take advantage of the non-standard evaluation:

library(dplyr)
gdi %>% 
  arrange(Y.2016, desc(country)) %>% 
  select(1, 23)
##           country     Y.2016
## 1         Croatia       0.00
## 2         Estonia   12548.30
## 3          Latvia   15737.79
## 4      Luxembourg   20155.80
## 5       Lithuania   24743.49
## 6        Slovenia   24756.63
## 7        Slovakia   48882.91
## 8        Bulgaria   60237.00
## 9         Ireland   97318.90
## 10        Finland  126590.00
## 11       Portugal  128789.39
## 12        Austria  214980.60
## 13        Belgium  243825.50
## 14    Netherlands  357383.00
## 15    Switzerland  458641.00
## 16          Spain  698701.00
## 17        Denmark 1091542.00
## 18         Poland 1136916.00
## 19          Italy 1142273.40
## 20 United Kingdom 1330854.00
## 21         France 1425435.00
## 22         Norway 1627136.00
## 23        Germany 2019917.00
## 24         Sweden 2402587.00
## 25 Czech Republic 2523229.00
## 26   Euro area 19 6736686.43
## 27          EU 28 9454683.60
## 28         Turkey         NA
## 29         Serbia         NA
## 30        Romania         NA
## 31          Malta         NA
## 32        Iceland         NA
## 33        Hungary         NA
## 34         Greece         NA
## 35         Cyprus         NA

If we need to provide the names of the columns instead, we can use the arrange_at:

gdi %>% 
  arrange_at("country", desc) %>%
  arrange_at("Y.2016") %>%
  select(1, 23)
##           country     Y.2016
## 1         Croatia       0.00
## 2         Estonia   12548.30
## 3          Latvia   15737.79
## 4      Luxembourg   20155.80
## 5       Lithuania   24743.49
## 6        Slovenia   24756.63
## 7        Slovakia   48882.91
## 8        Bulgaria   60237.00
## 9         Ireland   97318.90
## 10        Finland  126590.00
## 11       Portugal  128789.39
## 12        Austria  214980.60
## 13        Belgium  243825.50
## 14    Netherlands  357383.00
## 15    Switzerland  458641.00
## 16          Spain  698701.00
## 17        Denmark 1091542.00
## 18         Poland 1136916.00
## 19          Italy 1142273.40
## 20 United Kingdom 1330854.00
## 21         France 1425435.00
## 22         Norway 1627136.00
## 23        Germany 2019917.00
## 24         Sweden 2402587.00
## 25 Czech Republic 2523229.00
## 26   Euro area 19 6736686.43
## 27          EU 28 9454683.60
## 28         Turkey         NA
## 29         Serbia         NA
## 30        Romania         NA
## 31          Malta         NA
## 32        Iceland         NA
## 33        Hungary         NA
## 34         Greece         NA
## 35         Cyprus         NA

Using data.table

There are multiple ways to achieve the desired results with data.table, the one syntactically similar to base R is:

library(data.table)
gdidt <- as.data.table(gdi)
gdidt[order(Y.2016, -country), c(1, 23)]
##            country     Y.2016
##  1:        Croatia       0.00
##  2:        Estonia   12548.30
##  3:         Latvia   15737.79
##  4:     Luxembourg   20155.80
##  5:      Lithuania   24743.49
##  6:       Slovenia   24756.63
##  7:       Slovakia   48882.91
##  8:       Bulgaria   60237.00
##  9:        Ireland   97318.90
## 10:        Finland  126590.00
## 11:       Portugal  128789.39
## 12:        Austria  214980.60
## 13:        Belgium  243825.50
## 14:    Netherlands  357383.00
## 15:    Switzerland  458641.00
## 16:          Spain  698701.00
## 17:        Denmark 1091542.00
## 18:         Poland 1136916.00
## 19:          Italy 1142273.40
## 20: United Kingdom 1330854.00
## 21:         France 1425435.00
## 22:         Norway 1627136.00
## 23:        Germany 2019917.00
## 24:         Sweden 2402587.00
## 25: Czech Republic 2523229.00
## 26:   Euro area 19 6736686.43
## 27:          EU 28 9454683.60
## 28:         Turkey         NA
## 29:         Serbia         NA
## 30:        Romania         NA
## 31:          Malta         NA
## 32:        Iceland         NA
## 33:        Hungary         NA
## 34:         Greece         NA
## 35:         Cyprus         NA
##            country     Y.2016

Another option is to take advantage of the setorderv method provided by data.table. The important distinction is that this will sort the existing data.table in place, changing the source object. The other methods used above leave the source object untouched:

# This will sort the gdidt by reference - changing the input object
setorderv(gdidt, c("Y.2016", "country"), c(1, -1), na.last = TRUE)
# So we now just subset the (already sorted) gdidt
gdidt[, c(1, 23)]
##            country     Y.2016
##  1:        Croatia       0.00
##  2:        Estonia   12548.30
##  3:         Latvia   15737.79
##  4:     Luxembourg   20155.80
##  5:      Lithuania   24743.49
##  6:       Slovenia   24756.63
##  7:       Slovakia   48882.91
##  8:       Bulgaria   60237.00
##  9:        Ireland   97318.90
## 10:        Finland  126590.00
## 11:       Portugal  128789.39
## 12:        Austria  214980.60
## 13:        Belgium  243825.50
## 14:    Netherlands  357383.00
## 15:    Switzerland  458641.00
## 16:          Spain  698701.00
## 17:        Denmark 1091542.00
## 18:         Poland 1136916.00
## 19:          Italy 1142273.40
## 20: United Kingdom 1330854.00
## 21:         France 1425435.00
## 22:         Norway 1627136.00
## 23:        Germany 2019917.00
## 24:         Sweden 2402587.00
## 25: Czech Republic 2523229.00
## 26:   Euro area 19 6736686.43
## 27:          EU 28 9454683.60
## 28:         Turkey         NA
## 29:         Serbia         NA
## 30:        Romania         NA
## 31:          Malta         NA
## 32:        Iceland         NA
## 33:        Hungary         NA
## 34:         Greece         NA
## 35:         Cyprus         NA
##            country     Y.2016

Quick benchmarking

For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned packages. To do the benchmarking, we will use a very slightly modified flights data frame provided by Hadley Wickham’s nycflights13 package.



library(data.table)
library(dplyr)
dataurl <- "https://jozefhajnala.gitlab.io/r/post/data/"
flights <- readRDS(url(paste0(dataurl, "r006/flights.rds")))
flightsdt <- as.data.table(flights)

bench <- microbenchmark::microbenchmark(
  times = 100,
  base_order   = {flights[order(flights[, "flight"], -xtfrm(flights[, "carrier"])), ] },
  dt_oder      = {flightsdt[order(flight, -carrier), ] },
  dplyr_nse    = {flights %>% arrange(flight, desc(carrier)) },
  dplyr_scoped = {flights %>% arrange_at("carrier", desc) %>% arrange_at("flight") }
)

Under our particular circumstances, base R’s method seems to be the lowest of the options with data.table being the fastest.

TL;DR – Just want the code

No time for reading? Click here to get just the code with commentary

References

  1. post on subsetting on this blog
  2. documentation for base::order
  3. dplyr’s arrange function reference
  4. introduction to data.table
  5. original eurostat data source

Did you find the article helpful or interesting? Help others find it by sharing

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)