**Jozef's Rblog**, and kindly contributed to R-bloggers)

# Contents

# Introduction

In the previous article we discussed and learned how to reshape data with base R to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data. The key will be safety and avoiding complication and confusion as much as possible. This is why we:

- try to avoid using
`subset`

, as this function is implemented via non-standard evaluation. - also skip
`$`

as it uses partial matching and is hardly used with variables as column names. - not mention more details related to
`list`

properties of`data.frames`

here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the subsetting chapter of Hadley Wickham’s Advanced R

# How to use this article

- This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a
`case4base`

after all! - If you have no time for reading, you can click here to get just the code with commentary

First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):

```
gdi <- read.csv(
stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
)
head(gdi[, 1:6, drop = FALSE])
```

```
## country Y.1995 Y.1996 Y.1997 Y.1998 Y.1999
## 1 EU 28 NA NA NA NA 5982392.8
## 2 Euro area 19 NA NA NA NA 4393727.3
## 3 Belgium 140734.1 141599.4 145023.2 149705.2 153804.0
## 4 Bulgaria 1036.0 1468.1 12367.4 14921.1 16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6 Denmark 566363.0 578102.0 591416.0 621236.0 614893.0
```

Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.

The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.

# Selecting (subsetting) relevant data from a `data.frame`

In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting `data.frames`

can be summarised:

`dataframe_name[row_subset, col_subset, drop = FALSE]`

Where:

`dataframe_name`

is the name of the`data.frame`

we are subsetting`row_subset`

is a vector specifying the subset of rows`col_subset`

is a vector specifying the subset of columns`drop = FALSE`

is to make sure the result does not get simplified when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)

# Constructing meaningful subsets simply and safely

In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:

Let us now subset the rows of our data to get the data for countries that have known (not `NA`

) value in the year 2016 and this value is less than 1 million:

```
rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
gdi[rowidx, c(1, 23), drop = FALSE]
```

```
## country Y.2016
## 3 Belgium 243825.50
## 4 Bulgaria 60237.00
## 8 Estonia 12548.30
## 9 Ireland 97318.90
## 11 Spain 698701.00
## 13 Croatia 0.00
## 16 Latvia 15737.79
## 17 Lithuania 24743.49
## 18 Luxembourg 20155.80
## 21 Netherlands 357383.00
## 22 Austria 214980.60
## 24 Portugal 128789.39
## 26 Slovenia 24756.63
## 27 Slovakia 48882.91
## 28 Finland 126590.00
## 33 Switzerland 458641.00
```

Note that when creating the

`rowidx`

we omitted the`drop = FALSE`

despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.

# More ways to provide subset indices

Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:

- Logical vectors
`TRUE`

for rows/columns to subset,`FALSE`

for those to omit

```
st1 <- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8))
, c(TRUE, rep(FALSE, 20), rep(TRUE, 2))
, drop = FALSE
]
```

- Numeric vectors of row/column numbers to subset

```
st2 <- gdi[c(1:2, 27)
, c(1, 22:23)
, drop = FALSE
]
```

- Negative numeric vectors of row/column numbers to omit

```
st3 <- gdi[c(-3:-26, -28:-35)
, c(-2:-21)
, drop = FALSE
]
```

- Character vectors of row/column names to subset

```
st4 <- gdi[c("1", "2", "27") # we do not have very meaningful rownames
, c("country", "Y.2015", "Y.2016")
, drop = FALSE
]
st4
```

```
## country Y.2015 Y.2016
## 1 EU 28 9439578.39 9454683.60
## 2 Euro area 19 6598231.27 6736686.43
## 27 Slovakia 47464.71 48882.91
```

- All of the above give identical results

`identical(st1, st2) && identical(st2, st3) && identical(st3, st4)`

`## [1] TRUE`

## Tips

- The above methods are also working and safe for matrices
- Negative and positive numeric vectors cannot be combined

# Alternatives to base R

- dplyr::select and dplyr::filter
- Using data.table

# TL;DR – Just want the code

No time for reading? Click here to get just the code with commentary

# Exercises

- What is the difference between
`gdi[3, 3]`

and`gdi[3, 3, drop = FALSE]`

? - What is the difference between
`gdi[-3, 3]`

and`gdi[3, -3]`

? What about`gdi[-3, 3, drop = FALSE]`

? - Why cannot we omit the first part of the & in
`rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000`

. What would happen if we just did`rowidx <- gdi[, "Y.2016"] < 1000000`

? - Bonus question 1: Why is
`identical(gdi[, "Y.2016", drop = FALSE], gdi["Y.2016"])`

- Bonus question 2: Why is
`identical(gdi[, "Y.2016"], gdi[["Y.2016"]])`

# References

- Advanced R’s chapter on subsetting
- and on data types
- original eurostat data source

# Exercise answers

**leave a comment**for the author, please follow the link and comment on their blog:

**Jozef's Rblog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...