# R Vocabulary – Part 4

**Anindya Mozumdar**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the fourth and final part in the series of articles on R vocabulary. In this series, we explore most of the functions mentioned in Chapter 2 of the book Advanced R. The first, second and third part of the series can be read here, here and here.

In this article, we explore most of the functions mentioned under the heading *Statistics* in the chapter.

The *duplicated* function returns a vector of logical values to indicate which elements of a vector are duplicates. It can also be used to test whether a data frame has duplicate rows. In this case it returns a vector of logical values, with one value corresponding to each row of the data frame. In the examples below, the values 1 and 2 are duplicates in the vector and the row a, p, 1 is a duplicate in the data frame.

duplicated(c(1, 2, 1, 3, 2, 2)) ## [1] FALSE FALSE TRUE FALSE TRUE TRUE d <- data.frame(x = c("a", "b", "a"), y = c("p", "p", "p"), z = c(1, 2, 1)) duplicated(d) ## [1] FALSE FALSE TRUE

*unique* will remove the duplicate elements from a vector or data frame. *NA* and *NaN* are distinct values. Also, if you provide a vector of values via the *incomparables* argument, they will never be marked as duplicate.

x <- sample(1:3, 5, replace = TRUE) x ## [1] 2 1 1 1 2 unique(x) ## [1] 2 1 unique(c(1, 2, NA, 3, 3, 1)) ## [1] 1 2 NA 3 unique(c(1, NA, NaN, NA, 3)) ## [1] 1 NA NaN 3 unique(c(1, NA, NaN, NA, 3, 1), incomparables = NA) ## [1] 1 NA NaN NA 3 unique(d) ## x y z ## 1 a p 1 ## 2 b p 2

*merge* is used to perform the join operations between two data frames. By using the *by* and *all* arguments, different types of joins can be implemented. By default, all the common names between the two data frames will be used to join and an inner join will be performed. In the example below, the data frames will be merged using the column *x*. As there are three rows with the value 3 in the first data frame, and two rows with the value 3 in the second data frame, the result will have six rows with the value 3 as a Cartesian product is considered.

d1 <- data.frame( x = c(3, 3, 3, 1, 1), y = rnorm(5) ) d1 ## x y ## 1 3 -0.7109941 ## 2 3 1.3956645 ## 3 3 -0.0801967 ## 4 1 0.2057660 ## 5 1 2.4069664 d2 <- data.frame( x = c(1, 2, 3, 3, 2), z = rnorm(5) ) d2 ## x z ## 1 1 0.71737085 ## 2 2 0.95465791 ## 3 3 -0.34674981 ## 4 3 0.75841649 ## 5 2 0.02969553 merge(d1, d2) ## x y z ## 1 1 0.2057660 0.7173708 ## 2 1 2.4069664 0.7173708 ## 3 3 -0.7109941 -0.3467498 ## 4 3 -0.7109941 0.7584165 ## 5 3 1.3956645 -0.3467498 ## 6 3 1.3956645 0.7584165 ## 7 3 -0.0801967 -0.3467498 ## 8 3 -0.0801967 0.7584165

The remaining examples demonstrate the use of a few arguments which can be used with *merge*. In the last example, the value 2 which appears in *d2* but not *d1*, is included and the values of the variables from *d1* set to *NA* for these rows.

merge(d1, d2, by = "x") # same as above ## x y z ## 1 1 0.2057660 0.7173708 ## 2 1 2.4069664 0.7173708 ## 3 3 -0.7109941 -0.3467498 ## 4 3 -0.7109941 0.7584165 ## 5 3 1.3956645 -0.3467498 ## 6 3 1.3956645 0.7584165 ## 7 3 -0.0801967 -0.3467498 ## 8 3 -0.0801967 0.7584165 names(d2) <- c("x2", "z") merge(d1, d2, by.x = "x", by.y = "x2") # specify the join keys ## x y z ## 1 1 0.2057660 0.7173708 ## 2 1 2.4069664 0.7173708 ## 3 3 -0.7109941 -0.3467498 ## 4 3 -0.7109941 0.7584165 ## 5 3 1.3956645 -0.3467498 ## 6 3 1.3956645 0.7584165 ## 7 3 -0.0801967 -0.3467498 ## 8 3 -0.0801967 0.7584165 merge(d1, d2, by.x = "x", by.y = "x2", all.y = TRUE) # right join ## x y z ## 1 1 0.2057660 0.71737085 ## 2 1 2.4069664 0.71737085 ## 3 2 NA 0.95465791 ## 4 2 NA 0.02969553 ## 5 3 -0.7109941 -0.34674981 ## 6 3 -0.7109941 0.75841649 ## 7 3 1.3956645 -0.34674981 ## 8 3 1.3956645 0.75841649 ## 9 3 -0.0801967 -0.34674981 ## 10 3 -0.0801967 0.75841649

*order* takes a vector of values, and returns another vector which specifies the index of the values in the original vector, after sorting the original vector. For example, the value -1.27 is the lowest value in the example below, and it appeared as the 8th element in the original vector. So the first element in the vector returned by *order* will be 8. Similarly, 0.13 is the 7th largest value which appears in the 5th position in the original vector - so the 7th element in the vector returned by *order* will be 5.

set.seed(123) x <- round(rnorm(10), 2) x ## [1] -0.56 -0.23 1.56 0.07 0.13 1.72 0.46 -1.27 -0.69 -0.45 sort(x) ## [1] -1.27 -0.69 -0.56 -0.45 -0.23 0.07 0.13 0.46 1.56 1.72 order(x) ## [1] 8 9 1 10 2 4 5 7 3 6

It is easy to see that this can be used to sort a data frame by one variable. In the example below, we re-arrange the rows of *d* in the sorted order of *d$x*, thus sorting the data frame by *x*.

d <- data.frame(x = rnorm(10), y = rnorm(10)) d ## x y ## 1 1.2240818 -1.0678237 ## 2 0.3598138 -0.2179749 ## 3 0.4007715 -1.0260044 ## 4 0.1106827 -0.7288912 ## 5 -0.5558411 -0.6250393 ## 6 1.7869131 -1.6866933 ## 7 0.4978505 0.8377870 ## 8 -1.9666172 0.1533731 ## 9 0.7013559 -1.1381369 ## 10 -0.4727914 1.2538149 d[order(d$x), ] ## x y ## 8 -1.9666172 0.1533731 ## 5 -0.5558411 -0.6250393 ## 10 -0.4727914 1.2538149 ## 4 0.1106827 -0.7288912 ## 2 0.3598138 -0.2179749 ## 3 0.4007715 -1.0260044 ## 7 0.4978505 0.8377870 ## 9 0.7013559 -1.1381369 ## 1 1.2240818 -1.0678237 ## 6 1.7869131 -1.6866933

*rank* is used to calculate the sample ranks of the values in a vector. The argument *ties.method* is used to control how to handle duplicate values. In the example below, values 1 and 2 get the ranks 1.0 and 2.0 respectively. Since there are four values with the value 3, the default method will replace it by the average - so the ranks 3, 4, 5 and 6 are averaged to get a value of 4.5. The value 4 then gets the rank 7.0 and so on. Using *ties.method = “min”* results in a ranking similar to sports competitions, where everyone with the same value gets the same rank and the next rank starts at a point depending on the number of people with the same value.

set.seed(123) x <- sample(1:5, 10, replace = TRUE) x ## [1] 2 4 3 5 5 1 3 5 3 3 rank(x) ## [1] 2.0 7.0 4.5 9.0 9.0 1.0 4.5 9.0 4.5 4.5 rank(x, ties.method = "min") ## [1] 2 7 3 8 8 1 3 8 3 3

*quantile* calculates the quantiles of a vector, with the *probs* argument specifying which quantiles should be calculated. For example, to obtain the deciles of a vector, we specify the *probs* argument to vary by 0.1. In the example below, the median may not be exactly 0 due to randomness.

x <- rnorm(1000) hist(x)

quantile(x, probs = seq(0, 1, by = 0.1)) ## 0% 10% 20% 30% 40% ## -2.809774679 -1.284947198 -0.803345118 -0.490699560 -0.216418270 ## 50% 60% 70% 80% 90% ## 0.002773854 0.239657414 0.510534009 0.838337601 1.250304606 ## 100% ## 3.241039935

*sort*, as the name suggests, is used to sort a vector. Complex numbers are first sorted by the real part, and then the imaginary part.

sort(rnorm(10)) ## [1] -1.2187118 -0.4469593 -0.2112469 0.2497257 0.4690320 0.6851982 ## [7] 1.0405735 2.4162074 2.7973911 2.8322260 sort(sample(letters[1:3], 10, replace = TRUE)) ## [1] "a" "a" "b" "b" "b" "b" "c" "c" "c" "c" sort(sample(c(TRUE, FALSE), 10, replace = TRUE)) ## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE sort(rnorm(10) + rnorm(10) * 1i) ## [1] -1.1969352-1.0223473i -0.9544489-0.0889306i -0.4580181-0.2608322i ## [4] 0.2285570-1.1368931i 0.3001316+0.6061303i 0.4199516+0.0549120i ## [7] 0.7212208+1.8221888i 0.9356037+0.4640912i 1.4152763+0.4283320i ## [10] 1.6535472+0.2669183i

*table* and *ftable* are used to build contingency tables. They accept one or more objects which can be interpreted as factors and creates a contigency table of the counts at each combination of levels. The key difference between the two functions is that *ftable* creates a ‘flat’ table, a single matrix whose rows and columns correspond to the combination of the levels.

table(sample(letters[1:2], 10, replace = TRUE), sample(letters[3:4], 10, replace = TRUE), sample(letters[5:6], 10, replace = TRUE)) ## , , = e ## ## ## c d ## a 3 1 ## b 0 1 ## ## , , = f ## ## ## c d ## a 0 3 ## b 1 1 ftable(sample(letters[1:2], 10, replace = TRUE), sample(letters[3:4], 10, replace = TRUE), sample(letters[5:6], 10, replace = TRUE)) ## e f ## ## a c 2 0 ## d 1 1 ## b c 2 1 ## d 2 1

We now look at some functions related to building statistical models. Many of these functions are generic functions - so they behave differently based on the type of statistical modelling being performed. Let us fit using a linear regression where we try to predict the *mpg* (miles per gallon) using the *disp* (displacement) and *wt* (weight of the car). This is accomplished using the *lm* function. *lm* is used to fit a variety of linear models, but in this simple example we are doing a multiple linear regression.

car_mod <- lm(mpg ~ disp + wt, data = mtcars) class(car_mod) ## [1] "lm" summary(car_mod) ## ## Call: ## lm(formula = mpg ~ disp + wt, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.4087 -2.3243 -0.7683 1.7721 6.3484 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.96055 2.16454 16.151 4.91e-16 *** ## disp -0.01773 0.00919 -1.929 0.06362 . ## wt -3.35082 1.16413 -2.878 0.00743 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.917 on 29 degrees of freedom ## Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658 ## F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10

Recall that when we run the function *summary* on a data frame, it generates summary statistics of the columns in the data frame. When we pass it an object of class *lm*, it provides a summary of the model which was built using the *lm* function. The function *fitted* will return the fitted values in the training data while *predict* can be used to apply the model on new data. *resid* is used to extract the model residuals. *rstandard* and *rstudent* calculates the standardised and Studentised residuals. The help page on *influence.measures* provides the list of diagnostic functions for regression models. We have already looked at the *lm* function above. The function *glm* is used to build generalised linear models. We are not going to cover the details of such models in this article.

head(fitted(car_mod)) ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ## 23.34543 22.49097 25.27237 19.61467 ## Hornet Sportabout Valiant ## 17.05281 19.37863 predict(car_mod, newdata = data.frame( disp = rnorm(5, mean = 230, sd = 25), wt = rnorm(5, mean = 3.5, sd = 2) )) ## 1 2 3 4 5 ## 6.588855 16.698995 17.758500 30.541021 15.057982 head(resid(car_mod)) ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ## -2.345433 -1.490972 -2.472367 1.785333 ## Hornet Sportabout Valiant ## 1.647193 -1.278631 head(rstandard(car_mod)) ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ## -0.8222164 -0.5232550 -0.8757799 0.6243627 ## Hornet Sportabout Valiant ## 0.6092882 -0.4483953 head(rstudent(car_mod)) ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ## -0.8175008 -0.5165987 -0.8721585 0.6176689 ## Hornet Sportabout Valiant ## 0.6025603 -0.4421318

The next set of functions are the ones related to probability distributions. In general, given a probability distribution, the *d* function computes the density, *p* the distribution, *q* the quantile and *r* generates random numbers from that distribution. For example, the corresponding functions for the normal distribution are *dnorm*, *pnorm*, *qnorm* and *rnorm*. If there are parameters for the distribution, they are accepted as arguments to these functions with some default values provided by the function. For example, for the normal distribution, the default mean is 0 and standard deviation 1.

dnorm(0) ## [1] 0.3989423 pnorm(0) ## [1] 0.5 qnorm(0.5) ## [1] 0 rnorm(5, mean = 2, sd = 0.5) ## [1] 1.488188 1.554239 2.459171 1.773650 1.125814 qnorm(pnorm(c(0.2, 0.8, 3))) ## [1] 0.2 0.8 3.0

The last set of functions we look at in this article related to matrix algebra. *crossprod* and *tcrossprod* are used to calculate matrix cross-products. They are equivalent to *t(x) %*% y* and *x %*% t(y)* respectively, where *t* transposes a matrix while *%*%* is the matrix multiplication operator. *eigen* computes the eigenvalues and eigenvectors of a matrix and returns them as a list. *qr* and *svd* computes the QR and singular value decomposition of a matrix respectively.

m1 <- matrix(c(1, 2, 3, 4), nrow = 2) m2 <- matrix(c(5, 6, 7, 8), nrow = 2) m1 ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 m2 ## [,1] [,2] ## [1,] 5 7 ## [2,] 6 8 m1 %*% m2 ## [,1] [,2] ## [1,] 23 31 ## [2,] 34 46 crossprod(m1, m2) ## [,1] [,2] ## [1,] 17 23 ## [2,] 39 53 tcrossprod(m1, m2) ## [,1] [,2] ## [1,] 26 30 ## [2,] 38 44 eigen(m1) ## eigen() decomposition ## $values ## [1] 5.3722813 -0.3722813 ## ## $vectors ## [,1] [,2] ## [1,] -0.5657675 -0.9093767 ## [2,] -0.8245648 0.4159736 qr(m1) ## $qr ## [,1] [,2] ## [1,] -2.2360680 -4.9193496 ## [2,] 0.8944272 -0.8944272 ## ## $rank ## [1] 2 ## ## $qraux ## [1] 1.4472136 0.8944272 ## ## $pivot ## [1] 1 2 ## ## attr(,"class") ## [1] "qr" svd(m2) ## $d ## [1] 13.1900344 0.1516296 ## ## $u ## [,1] [,2] ## [1,] -0.6521255 -0.7581111 ## [2,] -0.7581111 0.6521255 ## ## $v ## [,1] [,2] ## [1,] -0.5920601 0.8058938 ## [2,] -0.8058938 -0.5920601

*solve* is used to solve a system of linear equations. The first argument provides the coefficients of the linear system in matrix form, while the second argument provides the right hand side of the system.

m <- matrix(c(1:4), nrow = 2) m ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 solve(m, c(10, 30)) ## [1] 25 -5

**leave a comment**for the author, please follow the link and comment on their blog:

**Anindya Mozumdar**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.