In this post, I try to show you in which situations using a data frame is appropriate, and in which it’s not.

## What is a data frame?

A data frame is just a list of vectors of the same length, each vector being a column.

This may convince you:

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  \$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.list(iris)
## [1] TRUE
length(iris)
## [1] 5
sapply(iris, typeof)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
##     "double"     "double"     "double"     "double"    "integer"
sapply(iris, length)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
##          150          150          150          150          150

## What is a list?

A list is just a vector of references to objects in memory.

x <- 1:1e6
pryr::object_size(x)
## 4 MB
y <- list(x, x, x)
pryr::object_size(y)
## 4 MB
## [1] "000000001E49C530"
## [1] "000000001E49C530" "000000001E49C530" "000000001E49C530"

So, basically, here y is a vector of 3 references, each pointing to the same object x in memory. This is very efficient because there is no need to copy x 3 times when creating y.

## Using package {dplyr}

Using {dplyr} operations such as mutate or select is very efficient.

• select:

library(dplyr)
mydf <- iris
mydf2 <- select(mydf, -Species)
##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "000000000B356168"
##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758"

So, when you use select, you get a new object. This object is a new data frame (a new list). Yet, remember that a list is nothing but a vector of references. So, this is extremely efficient because it creates only a new vector of 4 references pointing to objects already in memory.

• mutate:

mydf3 <- mutate(iris, Species = as.character(Species))
##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "000000000B356168"
##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"

This is the same when using mutate. You get a new object, yet you modified the 5-th variable only. So, the first 4 variables don’t have to be copied, your new data frame (list) can just point to the same 4 vectors in memory. R only creates a new vector of character and points to it in the new object.

So, adding/removing/modifying one variable of a data frame is efficient because R doesn’t have to copy the other variables.

## What about modifying one row of a data frame?

If you modify the first row of a data frame, then you modify the first element of each variable. If there are multiple references to these vectors, R would decide to copy them all, getting you a full copy of the data frame.

mydf4 <- mydf3
## "0000000029BAB238" "0000000029BAB718" "000000002841AB70" "000000002841B050" "000000002841B530"