Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Functions in R allow us to automate things, instead of copy-pasting code.

• Functions can have names, so it makes code easier to read
• If your requirements change, you only have one place where to change your code
• It minimises the risks associated with copy-pasting (e.g. forgetting to change a variable name, etc.)

# When to write a function?

It is commonly recommended that start writing a function whenever you copy-paste some code for the second time (i.e. you have 3 copies of the code)

For example, if you have the relative abundance of species in 3 communities :

```com1 = c(0.5,0.3,0.2)
com2 = c(0.7,0.2,0.1)
com3 = c(0.9,0.1)
```

You could calculate Shannon diversity for the first community with :

```-sum(com1*log(com1))

[1] 1.029653
```

And then for the second and third one with :

```-sum(com2*log(com2))

[1] 0.8018186

-sum(com3*log(com3))

[1] 0.325083
```

At this point, we’ve copy-pasted our code twice, so it is time to turn it into a function…

A good first step when writing a function is to determine what the function needs, what are its inputs. In this case, the function needs the relative abundances of a community. Let’s call them `p`, like in the traditional notation.

```p <- com1
-sum(p*log(p))

[1] 1.029653

p <- com2
-sum(p*log(p))

[1] 0.8018186
```

Notice that I immediately test my code, to make sure that I’ve extracted everything that I neede to make the code work.

Then, you only need to wrap this code and tell R that you function needs one argument, `p`

If you don’t mention anything, a function in R returns the result of the last command it executes

```diversite_shannon <- function(p) {
-sum(p*log(p))
}
diversite_shannon(com1)

[1] 1.029653

diversite_shannon(com2)

[1] 0.8018186

diversite_shannon(com3)

[1] 0.325083
```

Our code is now much more readable AND easier to maintain

(yes, we’re still copy-pasting some stuff…)

# Conditional execution

You can insert conditional statements inside your functions (anywhere in fact), using the `if` keyword

The typical structure of an IF statement goes like this :

```if (condition) {
# gets run if condition is true
} else {
# gets run if condition is false
}

salutations <- function(nom) {
if (nom == "Charles") {
print("Allo Charles")
} else {
print("Qui êtes vous?")
}
}
salutations("Esteban")

[1] "Qui êtes vous?"

salutations("Charles")

[1] "Allo Charles"
```

Note that it is optional for your function to return a value

# Arguments

Functions in R can have as many arguments as you wish

Usually, in R, the first arguments contain data, whereas the last ones contain details about how to do the calculations.

These arguments about calculation details can have default values, which the user only has to change if needed.

```pile_face <- function(n,probabilite_pile = 0.5) {
sample(
c("pile","face"),
size = n,
prob = c(probabilite_pile,1 - probabilite_pile),
replace = TRUE
)
}
pile_face(25)

[1] "face" "face" "pile" "face" "face" "pile" "pile" "pile" "pile" "face"
[11] "pile" "pile" "face" "pile" "pile" "pile" "face" "pile" "face" "face"
[21] "pile" "face" "pile" "pile" "pile"

pile_face(25, 0.9)

[1] "pile" "pile" "pile" "pile" "pile" "pile" "pile" "pile" "pile" "pile"
[11] "pile" "pile" "pile" "pile" "pile" "pile" "face" "pile" "pile" "pile"
[21] "face" "pile" "pile" "pile" "pile"
```

## Defensive programming

When you become comfortable with writing functions, there will rapidly come a point where you won’t remember how you coded everything inside every function you’ve written and the constaints associated.

For example, our function to calculate Shannon’s diversity expects to receive relative frequencies. The calculation is not defined if the sum of the `p` values is not 1.

However, our function let’s us do the calculation on absolute numbers

```diversite_shannon(c(1,5,25,12))

[1] -118.338
```

To protect our future-self, we can add some checks, which will stop the calculations if some condition is not met.

```diversite_shannon <- function(p) {
stopifnot(sum(p) == 1)
- sum(p*log(p))
}
diversite_shannon(c(1,5,25,12))

Error in diversite_shannon(c(1, 5, 25, 12)): sum(p) == 1 is not TRUE
```

You can also write more user-friendly messages with just a little more code

```diversite_shannon <- function(p) {
if (sum(p) != 1) {
stop("L'argument p doit contenir des probabilités relatives, dont la somme est 1.")
}
- sum(p*log(p))
}
diversite_shannon(c(1,5,25,12))

Error in diversite_shannon(c(1, 5, 25, 12)): L'argument p doit contenir des probabilités relatives, dont la somme est 1.
```

## The special dot-dot-dot (…) argument

Function definitions in R can also contain a special argument named dot-dot-dot. This argument, if present, will catch every argument that the function received and was not explicitely named in the function definition.

It can be especially useful to send these arguments back to another function.

If anytime you need an histogram you like to have it blue instead of white, you could create your own function, which calls the original, adding you favorite or often used options:

```bleustogram <- function(...){
hist(..., col = "royalblue")
}
bleustogram(rnorm(50))
```

```bleustogram(rlnorm(100))
```

# Many exit points

An R function can contain many points where it stops to return a value. In this case, these points need to be defined explicitely :

```diversite <- function(p, indice = "shannon") {
if (indice == "shannon") {
return(-sum(p*log(p)))
} else if (indice == "simpson") {
return(sum(p^2))
} else {
stop("L'indice doit être shannon ou simpson")
}
}
diversite(c(0.5,0.3,0.2))

[1] 1.029653

diversite(c(0.5,0.3,0.2), indice = "simpson")

[1] 0.38
```

Anything that you create inside a function exist only there, and cannot be accessed from the outside. Objects are reset every the functio is called.

A special feature of R is that, if a variable is not defined inside a function, R will also search outside of it, in the global environment, for an object having the same name.

```b <- 2
f1 <- function(d) {
d * b
}
f1(3)

[1] 6

f2 <- function(d) {
b <- 8
d * b
}
f2(3)

[1] 24

b

[1] 2

d

```

This is why it is important to be very careful while extracting the arguments necessary to your function. It is important to restart your R session once in a while, to make sure that things are not “accidently working” because an object with a particular name exist in your working environment.

# Introducing interation

Beside function, another programming technique to reduce code duplication is the use of iterations (syn. repetitions)

There are two styles of iteration in R : Imperative programming and functional programming.

Imperative programming includes FOR and WHILE loops. It is often the most intuitive way to begin, because the concepts are explicit.

Consequently, imperative programming involves lots of “plumbing code”, that repeats itself from a loop to another and “drowns” the actual intent of the code. Functional programming allows us to extract to the core task of our code and produces denser, easier to read and maintain code.

# FOR loops

If one wanted to automate the writing of the first verse of the song, we could write :

```for (i in 1:3) {
print("Baby shark, doo doo doo doo doo doo")
}
print("Baby shark!")

[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark!"
```

A FOR loop has 2 main components :

• The first line defines how many times we wish to do the loop
• The code between {} defines the code we wish to repeat.

Each iteration doesn’t need to do an identical job, their action can be customized based on the index the loop is currently at.

```mots = c("Baby shark","Mommy shark","It's the end")
for (i in seq_along(mots)) { # vs. 1:length(mots)
print(paste0(mots[i],", doo doo doo doo doo doo"))
}

[1] "Baby shark, doo doo doo doo doo doo"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "It's the end, doo doo doo doo doo doo"
```

## Nesting

You probably saw that one coming from a mile, but you can also nest loops inside one another:

```mots = c("Baby shark","Mommy shark","It's the end")
for (i in seq_along(mots)) {
for (j in 1:3) {
print(paste0(mots[i],", doo doo doo doo doo doo"))
}
print(paste0(mots[i],"!"))
}

[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark, doo doo doo doo doo doo"
[1] "Baby shark!"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "Mommy shark, doo doo doo doo doo doo"
[1] "Mommy shark!"
[1] "It's the end, doo doo doo doo doo doo"
[1] "It's the end, doo doo doo doo doo doo"
[1] "It's the end, doo doo doo doo doo doo"
[1] "It's the end!"
```

## Keeping a result for each iteration

In a loop, it is often useful to keep the result of some calculation at each iteration. In such a case, it is strongly recommended to pre-allocate our result object before the loop begins. This is the key to fast loops in R.

For example, let’s prepare a loop that would calculate the absolute value of a series of numbers

```nombres <- c(-1,0,1,-5)
valeurs_absolues <- vector("double", length(nombres))
for (i in seq_along(nombres)) {
valeurs_absolues[i] <- abs(nombres[i])
}
valeurs_absolues

[1] 1 0 1 5
```

## Unknown number of iterations

When you don’t know beforehand how many times our loop will run, there exist a second R structure which allows this kind of iteration : the `while` loop.

These loops are particularly useful in simulations.

For example, let’s run some code to determine how many coin tosses are needed to obtain a sequence of 3 heads in a row.

```tirages <- 0
piles_de_suite <- 0

while (piles_de_suite < 3) {

resultat <- pile_face(1)

if ((resultat) == "pile") {
piles_de_suite <- piles_de_suite + 1
} else {
piles_de_suite <- 0
}

tirages <- tirages + 1
}

tirages

[1] 33
```

# The Map family of functions

As said before, iteration can be tackly in a completely different way, using functionnal programming. In R, the `purrr` package contains many functions that allows us to do functional programming in a simple and intuitive way.

```library(purrr)
```

The principle behind each of these functions is all the same : instead of writing the code that makes the loop work, you give it a function and a series of elements to apply it on.

There are many `map` functions, depending on the return type you want to obtain.

• `map` returns a list
• `map_lgl` returns `TRUE`/`FALSE`
• `map_int` returns integers
• `map_dbl` returns floating point numbers
• `map_chr` returns text
• `map_df` returns a data.frame object

Let’s revisit our code snippet about absolute values. It could be converted to this simple piece of code

```nombres <- c(-1,0,1,-5)
valeurs_absolues <- map_dbl(nombres, abs)
valeurs_absolues

[1] 1 0 1 5
```

We thus save a lot of `plumbing` code about how to make that loop

You can also give `map` a custom made function, e.g. let’s say you wanted to calculate the absolute value of these numbers and then add 10.

```nombres <- c(-1,0,1,-5)
ma_fonction <- function(x) {
abs(x) + 10
}
map_dbl(nombres, ma_fonction)

[1] 11 10 11 15
```

If this custom made function is used only in this code, you can define it anonymously inside the map function :

```nombres <- c(-1,0,1,-5)
map_dbl(nombres, function(x) {abs(x) + 10})

[1] 11 10 11 15
```

Generally, this use is limited to short pieces of code.

## Shortcuts included in `map` functions

One of the advantages of the `map` functions in `purrr` package is that it enables us to cut the obvious parts of our code with shortcuts :

```  nombres <- c(-1,0,1,-5)
map_dbl(nombres, ~ abs(.) + 10)

[1] 11 10 11 15
```

You can replace the `function` part with a `~` and the name of the variable with a dot.

# Managing issues

When using `map` functions on long series of data, it can happen that our function fails for a reason or another. When a problem happens, the `map` function stops, with an error message, but you don’t retrieve all the instances that worked before the issue.

Let’s go back to our Shannon diversity function. If one of our communities does not contain relative frequencies, we loose all the remaining results…

```communautes <- list(
c(0.5,0.3,0.2),
c(0.9,0.1),
c(10,3,1)
)

map_dbl(communautes,diversite_shannon)

Error in .f(.x[[i]], ...): L'argument p doit contenir des probabilités relatives, dont la somme est 1.
```

The `purrr` package contains many adverbs to manage these kinds of situations. In each case, these adverbs wrap our function by modifying it’s behaviour in case of errors. Here we’re only seing the simplest case, `possibly`, to which you need to supply a value in case of error.

```map_dbl(communautes,possibly(diversite_shannon,NA))

[1] 1.029653 0.325083       NA
```

# There is also a map function to build data.frames

```map_df(communautes,function(x){
data.frame(
richesse = length(x),
shannon = diversite(x, indice = "shannon"),
simpson = diversite(x, indice = "simpson")
)
})

richesse    shannon simpson
1        3   1.029653    0.38
2        2   0.325083    0.82
3        3 -26.321688  110.00
```

This function can also me tremendously useful when you need to read a bunch of csv files from a folder and bind them in a single data.frame

```fichiers <- list.files(
"/Dossier/Avec/Les/Donnees",
pattern = "*.csv",
full.names = TRUE
)
```

# data.frames on input *and * output

To illustrate this case, let’s imagine a scenario where you have explored the relationship between body weight and brain weight in mammals.

```library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

data(msleep)

msleep %>%
ggplot(aes(x = bodywt, y = brainwt)) +
geom_point(aes(color = vore)) +
scale_x_log10() +
scale_y_log10()

Warning: Removed 27 rows containing missing values (geom_point).
```

You would now like to run a regression per group to compare the parameter values.

First, let’s see how we’d do that for a single group, the herbivores :

```x <- msleep %>% filter(vore == "herbi")
m <- lm(log(brainwt)~log(bodywt),data = x)
summary(m)

Call:
lm(formula = log(brainwt) ~ log(bodywt), data = x)

Residuals:
Min       1Q   Median       3Q      Max
-0.79504 -0.28970 -0.08648  0.19723  1.12209

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.94213    0.12464  -39.65  < 2e-16 ***
log(bodywt)  0.74212    0.03007   24.68 2.48e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4864 on 18 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared:  0.9713,	Adjusted R-squared:  0.9697
F-statistic: 609.1 on 1 and 18 DF,  p-value: 2.484e-15
```

It is not necessarily simple to extract the slope estimate from this model object to put them in a data.frame object.

```str(m)

List of 13
\$ coefficients : Named num [1:2] -4.942 0.742
..- attr(*, "names")= chr [1:2] "(Intercept)" "log(bodywt)"
\$ residuals    : Named num [1:20] -0.6656 0.6216 0.1733 -0.0253 0.5345 ...
..- attr(*, "names")= chr [1:20] "2" "4" "5" "6" ...
\$ effects      : Named num [1:20] 15.3808 12.0043 0.3263 -0.0469 0.4878 ...
..- attr(*, "names")= chr [1:20] "(Intercept)" "log(bodywt)" "" "" ...
\$ rank         : int 2
\$ fitted.values: Named num [1:20] -0.195 -2.942 -2.336 -5.178 -5.586 ...
..- attr(*, "names")= chr [1:20] "2" "4" "5" "6" ...
\$ assign       : int [1:2] 0 1
\$ qr           :List of 5
..\$ qr   : num [1:20, 1:2] -4.472 0.224 0.224 0.224 0.224 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..\$ : chr [1:20] "2" "4" "5" "6" ...
.. .. ..\$ : chr [1:2] "(Intercept)" "log(bodywt)"
.. ..- attr(*, "assign")= int [1:2] 0 1
..\$ qraux: num [1:2] 1.22 1.01
..\$ pivot: int [1:2] 1 2
..\$ tol  : num 1e-07
..\$ rank : int 2
..- attr(*, "class")= chr "qr"
\$ df.residual  : int 18
\$ na.action    : 'omit' Named int [1:12] 1 3 12 13 15 17 19 21 22 25 ...
..- attr(*, "names")= chr [1:12] "1" "3" "12" "13" ...
\$ xlevels      : Named list()
\$ call         : language lm(formula = log(brainwt) ~ log(bodywt), data = x)
\$ terms        :Classes 'terms', 'formula'  language log(brainwt) ~ log(bodywt)
.. ..- attr(*, "variables")= language list(log(brainwt), log(bodywt))
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..\$ : chr [1:2] "log(brainwt)" "log(bodywt)"
.. .. .. ..\$ : chr "log(bodywt)"
.. ..- attr(*, "term.labels")= chr "log(bodywt)"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(log(brainwt), log(bodywt))
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "log(brainwt)" "log(bodywt)"
\$ model        :'data.frame':	20 obs. of  2 variables:
..\$ log(brainwt): num [1:20] -0.86 -2.32 -2.16 -5.2 -5.05 ...
..\$ log(bodywt) : num [1:20] 6.397 2.695 3.512 -0.317 -0.868 ...
..- attr(*, "terms")=Classes 'terms', 'formula'  language log(brainwt) ~ log(bodywt)
.. .. ..- attr(*, "variables")= language list(log(brainwt), log(bodywt))
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..\$ : chr [1:2] "log(brainwt)" "log(bodywt)"
.. .. .. .. ..\$ : chr "log(bodywt)"
.. .. ..- attr(*, "term.labels")= chr "log(bodywt)"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(log(brainwt), log(bodywt))
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "log(brainwt)" "log(bodywt)"
..- attr(*, "na.action")= 'omit' Named int [1:12] 1 3 12 13 15 17 19 21 22 25 ...
.. ..- attr(*, "names")= chr [1:12] "1" "3" "12" "13" ...
- attr(*, "class")= chr "lm"

m\$coefficients

(Intercept) log(bodywt)
-4.942134    0.742122
```

In the `tidyverse`, there is a package made especially for such situations, to extract numbers from models and put them in a data.frame.

```library(broom)
tidy(m)

# A tibble: 2 x 5
term        estimate std.error statistic  p.value
<chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -4.94     0.125      -39.6 5.69e-19
2 log(bodywt)    0.742    0.0301      24.7 2.48e-15
```

You could also have been interest in model-level numbers instead :

```glance(m)

# A tibble: 1 x 11
r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
<dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
1     0.971         0.970 0.486      609. 2.48e-15     2  -12.9  31.8  34.8
# … with 2 more variables: deviance <dbl>, df.residual <int>
```

Now we have in hand everything we need to calculate a regression model per group

```resultats <- msleep %>%
group_by(vore) %>%
group_modify(function(x,...){
m <- lm(log(brainwt)~log(bodywt),data = x)
tidy(m)
})
resultats

# A tibble: 10 x 6
# Groups:   vore [5]
vore    term        estimate std.error statistic  p.value
<chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 carni   (Intercept)   -4.47     0.278     -16.1  8.61e- 7
2 carni   log(bodywt)    0.667    0.112       5.95 5.71e- 4
3 herbi   (Intercept)   -4.94     0.125     -39.6  5.69e-19
4 herbi   log(bodywt)    0.742    0.0301     24.7  2.48e-15
5 insecti (Intercept)   -5.11     0.167     -30.6  7.66e- 5
6 insecti log(bodywt)    0.702    0.0477     14.7  6.84e- 4
7 omni    (Intercept)   -4.45     0.202     -22.0  7.76e-13
8 omni    log(bodywt)    0.869    0.0724     12.0  4.28e- 9
9 <NA>    (Intercept)   -4.97     0.345     -14.4  7.23e- 4
10 <NA>    log(bodywt)    0.837    0.194       4.33 2.28e- 2
```

See that, just as when building functions, we made sure to test our code before integrating in the iteration code

You can then vizualize all these coefficients in a single, synthetic plot.

```resultats %>%
ggplot(aes(x = vore, y = estimate)) +
geom_point() +
geom_linerange(aes(ymin = estimate-std.error, ymax = estimate+std.error)) +
facet_wrap(~term, scale = "free_y", ncol = 1)
```