For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.
Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.
1. No need for vectors
There is no need to create vectors first and then add them as columns:
x <- 1:2 y <- 3:4 df <- data.frame(x, y) # just do this! df <- data.frame(x = 1:2, y = 3:4)
If you really need the columns as vectors, they can always be obtained using
While we’re here, that df thing…
2. …df is not a great variable name
Sure, you can call a variable
df and R will know when you mean that variable and when you mean the function,
df(). But why risk the confusion, when you could just call it something else? Like
3. No need to convert from a matrix
Here’s another rather bizarre way to make a data frame that I often see:
df1 <- matrix(1:4, ncol = 2, nrow = 2) df1 <- as.data.frame(df1) # or perhaps to name columns df1 <- matrix(1:4, 2, 2, dimnames = list(c(1, 2), c("x", "y"))) df1 <- as.data.frame(df1)
Which would again be better achieved simply using
df1 <- data.frame(x = 1:2, y = 3:4)
Using a matrix is especially problematic when you want to mix variable types, which is possible in data frames but not in matrices. Here, our numbers become characters in the matrix and hence factors in the data frame:
df1 <- matrix(c(1:2, letters[1:2]), 2, 2, dimnames = list(c(1, 2), c("x", "y"))) df1 <- as.data.frame(df1) # oh look, your numbers are now factors, that's not what you want str(df1) 'data.frame': 2 obs. of 2 variables: $ x: Factor w/ 2 levels "1","2": 1 2 ..- attr(*, "names")= chr "1" "2" $ y: Factor w/ 2 levels "a","b": 1 2 ..- attr(*, "names")= chr "1" "2"
Which brings us to…
4. …No strings as factors
df1 <- data.frame(x = 1:2, y = letters[1:2], stringsAsFactors = FALSE)
5. Consider the alternatives and use the inbuilt help
You might consider the newer
tibble in which strings are never factors, amongst other advantages such as pretty printing with information about variables. The syntax is just the same:
library(tibble) df1 <- tibble(x = 1:2, y = 3:4)
And when you know the command name –
data.frame for example, help is only “?” + command_name away. It isn’t always the best documentation, but it does generally tell you all you need to know.