# R Vocabulary – Part 1

**Anindya Mozumdar**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To be a proficient R user, you need to read and understand the material in the book Advanced R by Hadley Wickham. The second chapter in this book is on vocabulary – a list of functions from the *base*, *stats* and *utils* packages which all R users should be familiar with. In a series of posts, we will attempt to learn most of the functions mentioned in the chapter using some examples.

We will skip the function *?* and start with *str*. According to its documentation, *str* can be used to display the internal structure of an R object. Let us look at a few simple examples first.

x <- c(1, 2, 3) str(x)

## num [1:3] 1 2 3

x <- c(1L, 2L) str(x)

## int [1:2] 1 2

x <- c(TRUE, FALSE, TRUE, TRUE) str(x)

## logi [1:4] TRUE FALSE TRUE TRUE

x <- c("a", "b", "c") str(x)

## chr [1:3] "a" "b" "c"

x <- c(1 + 2i, 3 + 0i, 1i) str(x)

## cplx [1:3] 1+2i 3+0i 0+1i

str(charToRaw("radmuzom"))

## raw [1:8] 72 61 64 6d ...

From the above examples, we see that the for atomic vectors, it displays the type, the number of elements in the vector and the first few elements. What happens if we apply *str* to functions?

str(c)

## function (...)

str(ls)

## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, ## pattern, sorted = TRUE)

str(print)

## function (x, ...)

It’s interesting to see that the output is different for different functions. That is because *c* is a primitive function, *ls* is an R function while *print* is a S3 generic function. This can be verified by typing the function name in the console without any parantheses. An explanation of primitive or S3 generics is beyond the scope of this post.

Let us now look at lists.

l <- list(x = 1, a = "A") str(l)

## List of 2 ## $ x: num 1 ## $ a: chr "A"

l2 <- list(m = matrix(1:4, nrow = 2), l = l) str(l2)

## List of 2 ## $ m: int [1:2, 1:2] 1 2 3 4 ## $ l:List of 2 ## ..$ x: num 1 ## ..$ a: chr "A"

l3 <- list(l = l, l2 = l2, w = rnorm(10)) str(l3)

## List of 3 ## $ l :List of 2 ## ..$ x: num 1 ## ..$ a: chr "A" ## $ l2:List of 2 ## ..$ m: int [1:2, 1:2] 1 2 3 4 ## ..$ l:List of 2 ## .. ..$ x: num 1 ## .. ..$ a: chr "A" ## $ w : num [1:10] -0.0122 0.5986 0.9694 -0.7869 -1.3261 ...

From the output, we notice that *str* displays the name of the list elements, their class and the basic structure similar to the one we saw for vectors. Use the *max.level* argument to restrict the level of nesting in the output.

str(l3, max.level = 2)

## List of 3 ## $ l :List of 2 ## ..$ x: num 1 ## ..$ a: chr "A" ## $ l2:List of 2 ## ..$ m: int [1:2, 1:2] 1 2 3 4 ## ..$ l:List of 2 ## $ w : num [1:10] -0.0122 0.5986 0.9694 -0.7869 -1.3261 ...

A common use of *str* is to compactly look at the structure of a dataset.

str(InsectSprays)

## 'data.frame': 72 obs. of 2 variables: ## $ count: num 10 7 20 14 14 12 10 23 17 20 ... ## $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

The above example shows that this dataset is a *data.frame* object comprising 72 observations and 2 variables. The first variable is *count*, which is a numeric vector while the second variable is *spray* which is a factor with 6 levels.

From the above examples, it is hopefully clear that if you are unsure of what an R object is, *str* provides information useful to understand it’s structure. For datasets, it also helps to understand the number of rows and columns for that dataset.

*%in* and *match* are most useful in matching the elements of one vector in another vector.

x <- c(6, 37) y <- sample(1:100, 1000, replace = TRUE) x %in% y

## [1] TRUE TRUE

match(x, y)

## [1] 26 119

which(y == 6)

## [1] 26 156 190 233 295 316 360 390 492 618 648 667 968 987

Note that the length of the result returned by *%in%* is the same as the first argument. *match* only returns the indices of the first occurence of the values in *x*.

We won’t spend too much time on *=*, *<-* and *<<-* in this article. However, do remember that these are functions and we can use backticks to call them in the “usual” way for functions. The *->* and *->>* operators are rarely used.

`<-`(x, 3) x

## [1] 3

1 -> x x

## [1] 1

*$*, *[* and *[[* are operators which act on vectors, matrices, arrays or lists to extract or replace parts. They are described in great detail in the chapter **Subsetting**.

*head* returns the first parts of a variety of different objects, but is most useful for vectors or data frames. *tail* works similarly but returns the last parts of the object.

y <- sample(1:100, 1000, replace = TRUE) head(y)

## [1] 87 50 25 46 2 33

head(cars)

## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10

head(cars, n = 10)

## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10 ## 7 10 18 ## 8 10 26 ## 9 10 34 ## 10 11 17

head(ls)

## ## 1 function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, ## 2 pattern, sorted = TRUE) ## 3 { ## 4 if (!missing(name)) { ## 5 pos <- tryCatch(name, error = function(e) e) ## 6 if (inherits(pos, "error")) {

*subset* is used to return parts of a vector, matrix or data frame which meets conditions provided as an argument to the function. It is most useful for data frames.

subset(cars, speed < 10 & dist > 10)

## speed dist ## 4 7 22 ## 5 8 16

subset(cars, speed < 10 & dist > 10, select = speed)

## speed ## 4 7 ## 5 8

*with* is used to evaluate an R expression in an environment constructed from data. For interactive use, it usually saves some typing and is nicer to read.

For example, instead of

plot(cars$speed, cars$dist)

one can use

with(cars, plot(speed, dist))

*assign* is used to assign a value to a name in an environment.

rm(x) f <- function() { assign("x", 1, pos = 1) } f() x

## [1] 1

In the above example, the *pos* argument is a positive integer which denotes the position in the search list. This causes *x* to have the value *1* in the global environment.

*get* is in some sense the opposite of *assign*.

x <- 3 g <- function() { get("x") } g()

## [1] 3

*all.equal* is used to compare objects and report differences.

x <- c(2, 3) y <- c(2, 3) all.equal(x, y)

## [1] TRUE

x <- 1 all.equal(x, y)

## [1] "Numeric: lengths (1, 2) differ"

x <- c(1, 2) all.equal(x, y)

## [1] "Mean relative difference: 0.6666667"

l1 <- list(x = c(1, 2), y = c("A", "B")) l2 <- list(x = c(1, 2)) all.equal(l1, l2)

## [1] "Length mismatch: comparison on first 1 components"

l2 <- list(x = c(1, 2), y = c("C", "D")) all.equal(l1, l2)

## [1] "Component \"y\": 2 string mismatches"

l2 <- list(x = c(1, 2), y = c("A", "B")) all.equal(l1, l2)

## [1] TRUE

*identical* is used to safely and reliablty test for two objects being exactly equal. In *if* or *while* statements, and in logical expressions which use *&&* or *||*, *identical* will ensure that a single logical value is obtained.

2 == c(1, 2)

## [1] FALSE TRUE

identical(2, c(1, 2))

## [1] FALSE

1 == NULL

## logical(0)

identical(1, NULL)

## [1] FALSE

identical(1, 1.0)

## [1] TRUE

identical(1, 1L)

## [1] FALSE

We will not look at the relational operators *!=*, *==*, *>*, *>=*, *<* and *<=* in detail here. However, it is worth remembering that these operators are vectorized along with vector recycling (if one of the vectors is shorter than the other, then the elements of the shorter vector are recycled).

x <- c(1, 2) y <- c(1, 2) x < y

## [1] FALSE FALSE

x <- 1 x < y

## [1] FALSE TRUE

*is.na* should be used to test whether elements are missing. Note that one should not use the *==* relational operator.

x <- NA is.na(x)

## [1] TRUE

x == NA

## [1] NA

Also, note that there are separate constants for missing values of the atomic vector types.

x <- c(NA, NA) class(x)

## [1] "logical"

x <- c(NA, 1.0) class(x)

## [1] "numeric"

x <- c(NA_character_, NA_character_) class(x)

## [1] "character"

*complete.cases* is used to check which cases have no missing values and is most useful with data frames. For data frames, it returns a logical vector specifying which rows have no missing values across the entire sequence.

d <- data.frame( x = c(1, NA, 2), y = c("A", "B", NA) ) complete.cases(d)

## [1] TRUE FALSE FALSE

*is.finite* returns a logical vector specifying which elements are finite. Even though *NaN* is “not a number”, *is.finte* still returns *FALSE* when evaluated with *NaN* as the argument.

x <- c(1, 3.0, Inf, NaN, 7) is.finite(x)

## [1] TRUE TRUE FALSE FALSE TRUE

The basic math functions are explained via the examples below. While the examples use “scalar” values in most cases, all the operations are vectorized. Examples using the trigonometric functions are not provided.

5 * 3

## [1] 15

`*`(5, 3)

## [1] 15

5.1 * 2L

## [1] 10.2

5 * (2 + 3i)

## [1] 10+15i

(2 + 3i) + 7

## [1] 9+3i

(2 + 3i) - 7

## [1] -5+3i

3 / 5

## [1] 0.6

3L / 5L

## [1] 0.6

(3 + 7i) / 6

## [1] 0.5+1.166667i

2 ^ 3

## [1] 8

2.2 ^ 7.5

## [1] 369.9731

(2 + 3i) ^ 3

## [1] -46+9i

(2 + 3i) ^ (3 + 4i)

## [1] -0.2045529+0.8966233i

7 %% 5 # remainder

## [1] 2

7 %/% 5 # integer division

## [1] 1

abs(5)

## [1] 5

abs(5 + 3i)

## [1] 5.830952

abs(-5)

## [1] 5

sign(2)

## [1] 1

sign(-2)

## [1] -1

sign(0)

## [1] 0

sign(2 + 3i)

## Error in sign(2 + (0+3i)): unimplemented complex function

ceiling(c(3.2, 3.8))

## [1] 4 4

floor(c(3.2, 3.8))

## [1] 3 3

trunc(c(3.2, 3.8))

## [1] 3 3

round(c(3.2, 3.8))

## [1] 3 4

round(c(3.275, 3.811), digits = 2)

## [1] 3.27 3.81

signif(c(3.2, 3.8))

## [1] 3.2 3.8

signif(c(3.275, 3.811), digits = 2)

## [1] 3.3 3.8

round(-2.3)

## [1] -2

round(33, digits = -1) # nearest 10

## [1] 30

round(75, digits = -2) # nearest 100

## [1] 100

exp(1)

## [1] 2.718282

exp(-1)

## [1] 0.3678794

log(3)

## [1] 1.098612

log(-2)

## Warning in log(-2): NaNs produced

## [1] NaN

log(exp(3))

## [1] 3

exp(log(3))

## [1] 3

log10(100)

## [1] 2

log2(1024)

## [1] 10

sqrt(25)

## [1] 5

sqrt(-25)

## Warning in sqrt(-25): NaNs produced

## [1] NaN

sqrt(3 + 4i)

## [1] 2+1i

*max* will find the maximum element from numeric or character vectors. *pmax* will do an element by element comparison, and return the largest among the first element, largest among the second element and so on. If the vectors are not of equal length, then the elements of the shorter vectors are recycled.

max(c(1, 2.3), c(2.7, 1.5), c(4, 2.2))

## [1] 4

pmax(c(1, 2.3), c(2.7, 1.5), c(4, 2.2))

## [1] 4.0 2.3

*min* and *pmin* work in the same way as above. *prod* and *sum* calculates the product and sum of the values present in its arguments. *diff* is used to calculate lagged differences between subsequent values (the default lag is 1).

prod(rnorm(10) + 1)

## [1] -0.009277813

sum(rnorm(10) + 1)

## [1] 3.521682

prod(c(1i, 1 + 2i))

## [1] -2+1i

diff(1:10)

## [1] 1 1 1 1 1 1 1 1 1

diff(1:10, lag = 3)

## [1] 3 3 3 3 3 3 3

The cumulative versions of *max*, *min*, *prod* and *sum* return the cumulative results as a vector. For the nth element, it will apply the function on the nth element and the result of the cumulative function till the (n-1)th element.

x <- 1:10 cumsum(x)

## [1] 1 3 6 10 15 21 28 36 45 55

cumprod(x)

## [1] 1 2 6 24 120 720 5040 40320 ## [9] 362880 3628800

cummax(x)

## [1] 1 2 3 4 5 6 7 8 9 10

cummin(x)

## [1] 1 1 1 1 1 1 1 1 1 1

Next we will look at some of the basic descriptive statistical functions. The mean, median, standard deviation and variance of a variable are calculated as follows.

x <- rnorm(10) mean(x)

## [1] 0.5545163

median(x)

## [1] 0.3892076

sd(x)

## [1] 0.7962907

var(x)

## [1] 0.6340788

*cor* is used to calculate the correlation between a pair of variables. The *method* argument is used to specify which method to use – the Pearson correlation coefficient, Kendall’s rank correlation or Spearman’s rank correlation. The default method is the Pearson coefficient.

x <- rnorm(100) y <- rnorm(100) cor(x, y)

## [1] -0.06544386

cor(x, y, method = "kendall")

## [1] -0.02909091

cor(x, y, method = "spearman")

## [1] -0.03924392

**leave a comment**for the author, please follow the link and comment on their blog:

**Anindya Mozumdar**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.