Tagged NA values and labelled data #rstats

[This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

sjmisc-package: Working with labelled data

A major update of my sjmisc-package was just released an CRAN. A major change (see changelog for all changes )is the support of the latest release from the haven-package, a package to import and export SPSS, SAS or Stata files.

The sjmisc-package mainly addresses three domains:

  • reading and writing data between other statistical packages and R
  • functions to make working with labelled data easier
  • frequently applied recoding and variable transformation tasks, also with support for labelled data

In this post, I want to introduce the topic of labelled data and give some examples of what the sjmisc-package can do, with a special focus on tagged NA values.

Introduction into Labelled Data

Labelled data (or labelled vectors) is a common data structure in other statistical environments to store meta-information about variables, like variable names, value labels or multiple defined missing values. Labelled data not only extends R‘s capabilities to deal with proper value and variable labels, but also facilitates the representation of different types of missing values, like in other statistical software packages. Typically, in R, multiple declared missings cannot be represented in a similar way, like in ’SPSS’ or ‘SAS’, with the regular missing values. However, Hadley Wickham’s haven package introduced tagged_na values, which can do this. Tagged NA’s work exactly like regular R missing values except that they store one additional byte of information: a tag, which is usually a letter (“a” to “z”) or also may be a character number (“0” to “9”). This allows to indicate different missings.

library(haven)
x <- labelled(
  c(1:3, tagged_na("a", "c", "z"), 4:1),
  c("Agreement" = 1, "Disagreement" = 4, 
    "First" = tagged_na("c"),
    "Refused" = tagged_na("a"), 
    "Not home" = tagged_na("z"))
)

print(x)
# <Labelled double>
#  [1]     1     2     3 NA(a) 
#  [5] NA(c) NA(z)     4     3     2     1
# 
# Labels:
#  value        label
#      1    Agreement
#      4 Disagreement
#  NA(c)        First
#  NA(a)      Refused
#  NA(z)     Not home

Value Labels

Getting value labels

The get_labels()-method is a generic method to return value labels of a vector or data frame.

get_labels(efc$e42dep)
# [1] "independent"          "slightly dependent"
# [3] "moderately dependent" "severely dependent"

You can prefix the value labels with the associated values or return them as named vector with the include.values argument.

get_labels(efc$e42dep, include.values = "p")
# [1] "[1] independent"          "[2] slightly dependent"  
# [3] "[3] moderately dependent" "[4] severely dependent"

get_labels() also returns “labels” of factors, even if the factor has no label attributes. This is useful, if you need a generic method for your functions to get value labels, either for labelled data or for factors.

x <- factor(c("low", "mid", "low", "hi", "mid", "low"))
get_labels(x)
# [1] "hi"  "low" "mid"

Tagged missing values can also be included in the output, using the drop.na argument.

# get labels, including tagged NA values
x <- labelled(
  c(1:3, tagged_na("a", "c", "z"), 4:1),
  c("Agreement" = 1, "Disagreement" = 4, 
    "First" = tagged_na("c"),
    "Refused" = tagged_na("a"), 
    "Not home" = tagged_na("z"))
)

get_labels(x, include.values = "n", drop.na = FALSE)
#              1              4          
#    "Agreement" "Disagreement"
#
#      NA(c)          NA(a)          NA(z) 
#    "First"      "Refused"     "Not home"

Getting labelled values

The get_values() method returns the values for labelled values (i.e. values that have an associated label). We still use the vector x from the above examples.

print(x)
# 1 4 NA NA NA 
#  [1]  1  2  3 NA NA NA  4  3  2  1
# attr(,"labels")
#    Agreement Disagreement        First      Refused     Not home 
#            1            4           NA           NA           NA

get_values(x)
# [1] "1"     "4"     "NA(c)" "NA(a)" "NA(z)"

With the drop.na argument you can omit those values from the return values that are defined as missing.

get_values(x, drop.na = TRUE)
# [1] 1 4

Setting value labels

With set_labels() you can add label attributes to any vector. You can either return a new labelled vector, or label an existing vector.

x <- sample(1:4, 20, replace = TRUE)

# return new labelled vector
x <- set_labels(x, c("very low", "low", "mid", "hi"))
x
#  [1] 4 2 1 2 4 1 3 1 3 1 1 4 2 4 2 4 3 4 4 4
# attr(,"labels")
# very low      low      mid       hi 
#        1        2        3        4

# label existing vector
set_labels(x) <- c("too low", "less low", 
                   "mid", "very hi")
x
#  [1] 4 2 1 2 4 1 3 1 3 1 1 4 2 4 2 4 3 4 4 4
# attr(,"labels")
#  too low less low      mid  very hi 
#        1        2        3        4

To add explicit labels for values, use a named vector of labels as argument.

x <- c(1, 2, 3, 2, 4, 5)
x <- set_labels(x, c("strongly agree" = 1, 
                     "totally disagree" = 4, 
                     "refused" = 5,
                     "missing" = 9))
x
# [1] 1 2 3 2 4 5
# attr(,"labels")
#   strongly agree totally disagree          refused          missing 
#                1                4                5                9

Missing Values

Defining missing values

set_na() converts values of a vector or of multiple vectors in a data frame into tagged NAs, which means that these missing values get an information tag and a value label (which is, by default, the former value that was converted to NA). You can either return a new vector/data frame, or set NAs into an existing vector/data frame.

x <- sample(1:8, 100, replace = TRUE)
table(x)
# x
#  1  2  3  4  5  6  7  8 
# 10 12  6 13 12 17 18 12

set_na(x) <- c(1, 8)
x
#   [1]  2  6  6 NA  7  4 NA  3  6 NA  4 NA  5  4  2  5  2  2  3  2  5  6 NA
#  [24]  7  6  4  6  3  4 NA NA  5 NA  6 NA  7  7  7  6  6 NA  7  2  2 NA  6
#  [47]  4  6  5  7  5 NA NA  7  4  7  4  3  7  2  6  5  5  7  2 NA  6  6 NA
#  [70]  2  5  7  4  7 NA  2  7  7  7  4  6  3 NA  5  5 NA  7  4  3  4 NA  6
#  [93]  4  2 NA NA  6  7  5 NA
# attr(,"labels")
#  1  8 
# NA NA

table(x, useNA = "always")
# x
#    2    3    4    5    6    7 <NA> 
#   12    6   13   12   17   18   22

print_tagged_na(x)
#   [1]     2     6     6 NA(8)     7     4 NA(1)     3     6 NA(1)     4
#  [12] NA(8)     5     4     2     5     2     2     3     2     5     6
#  [23] NA(8)     7     6     4     6     3     4 NA(1) NA(1)     5 NA(1)
#  [34]     6 NA(8)     7     7     7     6     6 NA(1)     7     2     2
#  [45] NA(1)     6     4     6     5     7     5 NA(8) NA(8)     7     4
#  [56]     7     4     3     7     2     6     5     5     7     2 NA(8)
#  [67]     6     6 NA(8)     2     5     7     4     7 NA(1)     2     7
#  [78]     7     7     4     6     3 NA(1)     5     5 NA(8)     7     4
#  [89]     3     4 NA(8)     6     4     2 NA(8) NA(8)     6     7     5
# [100] NA(1)

x <- factor(c("a", "b", "c"))
x
# [1] a b c
# Levels: a b c

set_na(x) <- "b" 
x
# [1] a    <NA> c   
# attr(,"labels")
#  b 
# NA 
# Levels: a b c

Getting missing values

The get_na() function returns all tagged NA values.

set_na(efc$c87cop6) <- 3
get_na(efc$c87cop6)
# Often 
#    NA

get_na(efc$c87cop6, as.tag = TRUE)
#   Often 
# "NA(3)"

Replacing specific NA with values

While set_na() allows you to replace values with specific tagged NA’s, replace_na() allows you to replace either all NA values of a vector or specific tagged NA values with a non-NA value.

str(efc$c84cop3)
#  atomic [1:908] 2 3 1 3 1 3 4 2 3 1 ...
#  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#  - attr(*, "labels")= Named num [1:4] 1 2 3 4
#   ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always"

set_na(efc$c84cop3) <- c(2, 3)
str(efc$c84cop3)
#  atomic [1:908] NA NA 1 NA 1 NA 4 NA NA 1 ...
#  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#  - attr(*, "labels")= Named num [1:4] 1 NA NA 4
#   ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always"

get_na(efc$c84cop3)
# Sometimes     Often 
#        NA        NA

replace_na(efc$c84cop3, na.label = "restored NA", tagged.na = "2") <- 2
str(efc$c84cop3)
#  atomic [1:908] 2 NA 1 NA 1 NA 4 2 NA 1 ...
#  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#  - attr(*, "labels")= Named num [1:4] 1 2 4 NA
#   ..- attr(*, "names")= chr [1:4] "Never" "restored NA" "Always" "Often"

get_na(efc$c84cop3)
# Often 
#    NA

get_labels(efc$c84cop3, include.values = "p")
# [1] "[1] Never"       "[2] restored NA" "[4] Always"

Conclusions

Labelled data vastly extends R‘s capabilities to deal with value and variable labels. The sjmisc-package offers a collection of convenient functions to work with labelled data, which might be of interest especially for users coming from other statistical packages like SPSS, who want to switch to R. Packages like sjPlot facilitate the features of labelled data, making it easy to produce well annotated plots (see these vignettes for various examples). A slightly more comprehensive introduction into the sjmisc-package can be found here.


Tagged: labelled data, R, rstats, sjPlot, SPSS

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)