Tagged NA values and labelled data #rstats
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
sjmisc-package: Working with labelled data
A major update of my sjmisc-package was just released an CRAN. A major change (see changelog for all changes )is the support of the latest release from the haven-package, a package to import and export SPSS, SAS or Stata files.
The sjmisc-package mainly addresses three domains:
- reading and writing data between other statistical packages and R
- functions to make working with labelled data easier
- frequently applied recoding and variable transformation tasks, also with support for labelled data
In this post, I want to introduce the topic of labelled data and give some examples of what the sjmisc-package can do, with a special focus on tagged NA values.
Introduction into Labelled Data
Labelled data (or labelled vectors) is a common data structure in other statistical environments to store meta-information about variables, like variable names, value labels or multiple defined missing values. Labelled data not only extends R‘s capabilities to deal with proper value and variable labels, but also facilitates the representation of different types of missing values, like in other statistical software packages. Typically, in R, multiple declared missings cannot be represented in a similar way, like in ’SPSS’ or ‘SAS’, with the regular missing values. However, Hadley Wickham’s haven package introduced tagged_na
values, which can do this. Tagged NA’s work exactly like regular R missing values except that they store one additional byte of information: a tag, which is usually a letter (“a” to “z”) or also may be a character number (“0” to “9”). This allows to indicate different missings.
library(haven) x <- labelled( c(1:3, tagged_na("a", "c", "z"), 4:1), c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z")) ) print(x) # <Labelled double> # [1] 1 2 3 NA(a) # [5] NA(c) NA(z) 4 3 2 1 # # Labels: # value label # 1 Agreement # 4 Disagreement # NA(c) First # NA(a) Refused # NA(z) Not home
Value Labels
Getting value labels
The get_labels()
-method is a generic method to return value labels of a vector or data frame.
get_labels(efc$e42dep) # [1] "independent" "slightly dependent" # [3] "moderately dependent" "severely dependent"
You can prefix the value labels with the associated values or return them as named vector with the include.values
argument.
get_labels(efc$e42dep, include.values = "p") # [1] "[1] independent" "[2] slightly dependent" # [3] "[3] moderately dependent" "[4] severely dependent"
get_labels()
also returns “labels” of factors, even if the factor has no label attributes. This is useful, if you need a generic method for your functions to get value labels, either for labelled data or for factors.
x <- factor(c("low", "mid", "low", "hi", "mid", "low")) get_labels(x) # [1] "hi" "low" "mid"
Tagged missing values can also be included in the output, using the drop.na
argument.
# get labels, including tagged NA values x <- labelled( c(1:3, tagged_na("a", "c", "z"), 4:1), c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), "Refused" = tagged_na("a"), "Not home" = tagged_na("z")) ) get_labels(x, include.values = "n", drop.na = FALSE) # 1 4 # "Agreement" "Disagreement" # # NA(c) NA(a) NA(z) # "First" "Refused" "Not home"
Getting labelled values
The get_values()
method returns the values for labelled values (i.e. values that have an associated label). We still use the vector x from the above examples.
print(x) # 1 4 NA NA NA # [1] 1 2 3 NA NA NA 4 3 2 1 # attr(,"labels") # Agreement Disagreement First Refused Not home # 1 4 NA NA NA get_values(x) # [1] "1" "4" "NA(c)" "NA(a)" "NA(z)"
With the drop.na argument you can omit those values from the return values that are defined as missing.
get_values(x, drop.na = TRUE) # [1] 1 4
Setting value labels
With set_labels()
you can add label attributes to any vector. You can either return a new labelled vector, or label an existing vector.
x <- sample(1:4, 20, replace = TRUE) # return new labelled vector x <- set_labels(x, c("very low", "low", "mid", "hi")) x # [1] 4 2 1 2 4 1 3 1 3 1 1 4 2 4 2 4 3 4 4 4 # attr(,"labels") # very low low mid hi # 1 2 3 4 # label existing vector set_labels(x) <- c("too low", "less low", "mid", "very hi") x # [1] 4 2 1 2 4 1 3 1 3 1 1 4 2 4 2 4 3 4 4 4 # attr(,"labels") # too low less low mid very hi # 1 2 3 4
To add explicit labels for values, use a named vector of labels as argument.
x <- c(1, 2, 3, 2, 4, 5) x <- set_labels(x, c("strongly agree" = 1, "totally disagree" = 4, "refused" = 5, "missing" = 9)) x # [1] 1 2 3 2 4 5 # attr(,"labels") # strongly agree totally disagree refused missing # 1 4 5 9
Missing Values
Defining missing values
set_na()
converts values of a vector or of multiple vectors in a data frame into tagged NAs, which means that these missing values get an information tag and a value label (which is, by default, the former value that was converted to NA). You can either return a new vector/data frame, or set NAs into an existing vector/data frame.
x <- sample(1:8, 100, replace = TRUE) table(x) # x # 1 2 3 4 5 6 7 8 # 10 12 6 13 12 17 18 12 set_na(x) <- c(1, 8) x # [1] 2 6 6 NA 7 4 NA 3 6 NA 4 NA 5 4 2 5 2 2 3 2 5 6 NA # [24] 7 6 4 6 3 4 NA NA 5 NA 6 NA 7 7 7 6 6 NA 7 2 2 NA 6 # [47] 4 6 5 7 5 NA NA 7 4 7 4 3 7 2 6 5 5 7 2 NA 6 6 NA # [70] 2 5 7 4 7 NA 2 7 7 7 4 6 3 NA 5 5 NA 7 4 3 4 NA 6 # [93] 4 2 NA NA 6 7 5 NA # attr(,"labels") # 1 8 # NA NA table(x, useNA = "always") # x # 2 3 4 5 6 7 <NA> # 12 6 13 12 17 18 22 print_tagged_na(x) # [1] 2 6 6 NA(8) 7 4 NA(1) 3 6 NA(1) 4 # [12] NA(8) 5 4 2 5 2 2 3 2 5 6 # [23] NA(8) 7 6 4 6 3 4 NA(1) NA(1) 5 NA(1) # [34] 6 NA(8) 7 7 7 6 6 NA(1) 7 2 2 # [45] NA(1) 6 4 6 5 7 5 NA(8) NA(8) 7 4 # [56] 7 4 3 7 2 6 5 5 7 2 NA(8) # [67] 6 6 NA(8) 2 5 7 4 7 NA(1) 2 7 # [78] 7 7 4 6 3 NA(1) 5 5 NA(8) 7 4 # [89] 3 4 NA(8) 6 4 2 NA(8) NA(8) 6 7 5 # [100] NA(1) x <- factor(c("a", "b", "c")) x # [1] a b c # Levels: a b c set_na(x) <- "b" x # [1] a <NA> c # attr(,"labels") # b # NA # Levels: a b c
Getting missing values
The get_na()
function returns all tagged NA values.
set_na(efc$c87cop6) <- 3 get_na(efc$c87cop6) # Often # NA get_na(efc$c87cop6, as.tag = TRUE) # Often # "NA(3)"
Replacing specific NA with values
While set_na()
allows you to replace values with specific tagged NA’s, replace_na()
allows you to replace either all NA values of a vector or specific tagged NA values with a non-NA value.
str(efc$c84cop3) # atomic [1:908] 2 3 1 3 1 3 4 2 3 1 ... # - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?" # - attr(*, "labels")= Named num [1:4] 1 2 3 4 # ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always" set_na(efc$c84cop3) <- c(2, 3) str(efc$c84cop3) # atomic [1:908] NA NA 1 NA 1 NA 4 NA NA 1 ... # - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?" # - attr(*, "labels")= Named num [1:4] 1 NA NA 4 # ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always" get_na(efc$c84cop3) # Sometimes Often # NA NA replace_na(efc$c84cop3, na.label = "restored NA", tagged.na = "2") <- 2 str(efc$c84cop3) # atomic [1:908] 2 NA 1 NA 1 NA 4 2 NA 1 ... # - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?" # - attr(*, "labels")= Named num [1:4] 1 2 4 NA # ..- attr(*, "names")= chr [1:4] "Never" "restored NA" "Always" "Often" get_na(efc$c84cop3) # Often # NA get_labels(efc$c84cop3, include.values = "p") # [1] "[1] Never" "[2] restored NA" "[4] Always"
Conclusions
Labelled data vastly extends R‘s capabilities to deal with value and variable labels. The sjmisc-package offers a collection of convenient functions to work with labelled data, which might be of interest especially for users coming from other statistical packages like SPSS, who want to switch to R. Packages like sjPlot facilitate the features of labelled data, making it easy to produce well annotated plots (see these vignettes for various examples). A slightly more comprehensive introduction into the sjmisc-package
can be found here.
Tagged: labelled data, R, rstats, sjPlot, SPSS
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.