# Automatic Code Cleaning in R with Rclean

**rOpenSci - open tools for open science**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Leave the code cleaner than you found it.

– R.C. Martin in

Clean Code

The R language has become very popular among scientists and analysts

because it enables the rapid development of software and empowers

scientific investigation. However, regardless of the language used,

data analysis is usually complicated. Because of various project

complexities and time constraints, analytical software often reflects

these challenges. “What did I measure? What analyses are relevant to

the study? Do I need to transform the data? What’s the function for the

analysis I want to run?” Although many researchers see the value in

learning to write software, the time investment for learning a

programming language alone is still exceedingly high for many, let

alone also learning software best practices. The downside to the rapid

spread of data science is that learning to create good software takes

a back-seat to just writing code that will get the job “done” leading

to issues with transparency and software that is highly unstable

(i.e. buggy and not reproducible).

The goal of the R package Rclean

is to provide a automated tool to help scientists more easily write

better code. Specifically, Rclean

has been designed to facilitate the isolation of the code needed to

produce one or more results, because more often then not, when someone

is writing an R script, the ultimate goal is analytical results for

inference, such as a set of statistical analyses and associated

figures and tables. As the investigative process is inherently

iterative, this set of results is nearly always a subset of a much

larger set of possible ways to explore a dataset. There are many

statistical tests and visualizations and other representations that

can be employed in a myriad of ways depending on how the data are

processed. This commonly leads to lengthy, complicated scripts from

which researchers manually subset results, but which are likely never

to be refactored because of the difficulty in disentangling the code.

The Rclean package uses a

technique based on data provenance and network algorithms to isolate

code for a desired result *automatically*. The intent is to ease

refactoring for scientists that use R but do not have formal training

in software design and specifically with the “art” of creating clean

code, which in part is the development of an intuitive sense of how

code is and/or should be organized. However, manually culling code is

tedious and potentially leads to errors regardless of the level of

expertise a coder might have; therefore, we see

Rclean as a useful tool for

programming in R at any level of ability. Here, we’ll cover

details on the implementation and design of the package, a general

example of how it can be used and thoughts on its future development.

### How to use rclean to write cleaner code

#### Installation

Through the helpful feedback from the rOpenSci community, the package

has recently passed software review and a supporting article was

recently published in the Journal of Open Source Software ^{1},

in which you can find more details about the package. The package is

hosted through the rOpenSci organization on GitHub, and the package

can be installed using the devtools

package ^{2} directly from the repository

(https://github.com/ROpenSci/Rclean).

```
library(devtools)
install_github("ROpenSci/Rclean")
```

If you do not already have

RGraphviz,

you will need to install it using the following code *before*

installing Rclean:

```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rgraphviz")
```

#### Isolating code for a set of results

Analytical scripts that have not been refactored are often both long

and complicated. However, a script doesn’t need to be long to be

complicated. The following example script presents some challenges

such that even though it’s not a long script, picking through it to

get a result would likely prove to be frustrating.

```
library(stats)
x <- 1:100
x <- log(x)
x <- x * 2
x <- lapply(x, rep, times = 4)
### This is a note that I made for myself.
### Next time, make sure to use a different analysis.
### Also, check with someone about how to run some other analysis.
x <- do.call(cbind, x)
### Now I'm going to create a different variable.
### This is the best variable the world has ever seen.
x2 <- sample(10:1000, 100)
x2 <- lapply(x2, rnorm)
### Wait, now I had another thought about x that I want to work through.
x <- x * 2
colnames(x) <- paste0("X", seq_len(ncol(x)))
rownames(x) <- LETTERS[seq_len(nrow(x))]
x <- t(x)
x[, "A"] <- sqrt(x[, "A"])
for (i in seq_along(colnames(x))) {
set.seed(17)
x[, i] <- x[, i] + runif(length(x[, i]), -1, 1)
}
### Ok. Now I can get back to x2.
### Now I just need to check out a bunch of stuff with it.
lapply(x2, length)[1]
max(unlist(lapply(x2, length)))
range(unlist(lapply(x2, length)))
head(x2[[1]])
tail(x2[[1]])
## Now, based on that stuff, I need to subset x2.
x2 <- lapply(x2, function(x) x[1:10])
## And turn it into a matrix.
x2 <- do.call(rbind, x2)
## Now, based on x2, I need to create x3.
x3 <- x2[, 1:2]
x3 <- apply(x3, 2, round, digits = 3)
## Oh wait! Another thought about x.
x[, 1] <- x[, 1] * 2 + 10
x[, 2] <- x[, 1] + x[, 2]
x[, "A"] <- x[, "A"] * 2
## Now, I want to run an analysis on two variables in x2 and x3.
fit.23 <- lm(x2 ~ x3, data = data.frame(x2[, 1], x3[, 1]))
summary(fit.23)
## And while I'm at it, I should do an analysis on x.
x <- data.frame(x)
fit.xx <- lm(A~B, data = x)
summary(fit.xx)
shapiro.test(residuals(fit.xx))
## Ah, it looks like I should probably transform A.
## Let's try that.
fit_sqrt_A <- lm(I(sqrt(A))~B, data = x)
summary(fit_sqrt_A)
shapiro.test(residuals(fit_sqrt_A))
## Looks good!
## After that. I came back and ran another analysis with
## x2 and a new variable.
z <- c(rep("A", nrow(x2) / 2), rep("B", nrow(x2) / 2))
fit_anova <- aov(x2 ~ z, data = data.frame(x2 = x2[, 1], z))
summary(fit_anova)
```

So, let’s say we’ve come to our script wanting to extract the code to

produce one of the results `fit_sqrt_A`

, which is an analysis that is

the fitted model object for an analysis. We might want to double check

the results, and we also might need to use the code again for another

purpose, such as creating a plot of the patterns supported by the

test. Manually tracing through our code for all the variables used in

the test and finding all of the code used to prepare them

for the analysis would be difficult, especially given the

fact that we have used “x” as a prefix for multiple unrelated objects

in the script. However, Rclean can

do this easily via the `clean()`

function.

```
library(Rclean)
script <- system.file("example",
"long_script.R",
package = "Rclean")
clean(script, "fit_sqrt_A")
```

```
x <- 1:100
x <- log(x)
x <- x * 2
x <- lapply(x, rep, times = 4)
x <- do.call(cbind, x)
x <- x * 2
colnames(x) <- paste0("X", seq_len(ncol(x)))
rownames(x) <- LETTERS[seq_len(nrow(x))]
x <- t(x)
x[, "A"] <- sqrt(x[, "A"])
for (i in seq_along(colnames(x))) {
set.seed(17)
x[, i] <- x[, i] + runif(length(x[, i]), -1, 1)
}
x[, 1] <- x[, 1] * 2 + 10
x[, 2] <- x[, 1] + x[, 2]
x[, "A"] <- x[, "A"] * 2
x <- data.frame(x)
fit_sqrt_A <- lm(I(sqrt(A)) ~ B, data = x)
```

The output is the code that Rclean

has picked out from the tangled bits of code, which in this case is an

example script included with the package. Here’s a view of this

isolated code highlighted in the original script.

```
library(stats)
x <- 1:100
x <- log(x)
x <- x * 2
x <- lapply(x, rep, times = 4)
### This is a note that I made for myself.
### Next time, make sure to use a different analysis.
### Also, check with someone about how to run some other analysis.
x <- do.call(cbind, x)
### Now I'm going to create a different variable.
### This is the best variable the world has ever seen.
x2 <- sample(10:1000, 100)
x2 <- lapply(x2, rnorm)
### Wait, now I had another thought about x that I want to work through.
x <- x * 2
colnames(x) <- paste0("X", seq_len(ncol(x)))
rownames(x) <- LETTERS[seq_len(nrow(x))]
x <- t(x)
x[, "A"] <- sqrt(x[, "A"])
for (i in seq_along(colnames(x))) {
set.seed(17)
x[, i] <- x[, i] + runif(length(x[, i]), -1, 1)
}
### Ok. Now I can get back to x2.
### Now I just need to check out a bunch of stuff with it.
lapply(x2, length)[1]
max(unlist(lapply(x2, length)))
range(unlist(lapply(x2, length)))
head(x2[[1]])
tail(x2[[1]])
## Now, based on that stuff, I need to subset x2.
x2 <- lapply(x2, function(x) x[1:10])
## And turn it into a matrix.
x2 <- do.call(rbind, x2)
## Now, based on x2, I need to create x3.
x3 <- x2[, 1:2]
x3 <- apply(x3, 2, round, digits = 3)
## Oh wait! Another thought about x.
x[, 1] <- x[, 1] * 2 + 10
x[, 2] <- x[, 1] + x[, 2]
x[, "A"] <- x[, "A"] * 2
## Now, I want to run an analysis on two variables in x2 and x3.
fit.23 <- lm(x2 ~ x3, data = data.frame(x2[, 1], x3[, 1]))
summary(fit.23)
## And while I'm at it, I should do an analysis on x.
x <- data.frame(x)
fit.xx <- lm(A~B, data = x)
summary(fit.xx)
shapiro.test(residuals(fit.xx))
## Ah, it looks like I should probably transform A.
## Let's try that.
fit_sqrt_A <- lm(I(sqrt(A))~B, data = x)
summary(fit_sqrt_A)
shapiro.test(residuals(fit_sqrt_A))
## Looks good!
## After that. I came back and ran another analysis with
## x2 and a new variable.
z <- c(rep("A", nrow(x2) / 2), rep("B", nrow(x2) / 2))
fit_anova <- aov(x2 ~ z, data = data.frame(x2 = x2[, 1], z))
summary(fit_anova)
```

The isolated code can now be visually inspected to adapt the original

code or ported to a new, refactored script using `keep()`

.

```
fitSA <- clean(script, "fit_sqrt_A")
keep(fitSA)
```

This will pass the code to the clipboard for pasting into another

document. To write directly to a new file, a file path can be

specified.

```
fitSA <- clean(script, "fit_sqrt_A")
keep(fitSA, file = "fit_SA.R")
```

To explore more possible variables to extract, the `get_vars()`

function

can be used to produce a list of the variables (aka. objects) that are

created in the script.

```
get_vars(script)
```

```
[1] "x" "x2" "i" "x3" "fit.23"
[6] "fit.xx" "fit_sqrt_A" "z" "fit_anova"
```

Especially when the code for different variables are entangled, it can

be useful to visual the code in order to devise an approach to

cleaning. The `code_graph()`

function can also give us a visual of the

code and the objects that they produce.

```
code_graph(script)
```

After examining the output from `get_vars()`

and `code_graph()`

, it is

possible that more than one object needs to be isolated. This can be

done by adding additional objects to the list of *vars*.

```
clean(script, vars = c("fit_sqrt_A", "fit_anova"))
```

```
x <- 1:100
x <- log(x)
x <- x * 2
x <- lapply(x, rep, times = 4)
x <- do.call(cbind, x)
x2 <- sample(10:1000, 100)
x2 <- lapply(x2, rnorm)
x <- x * 2
colnames(x) <- paste0("X", seq_len(ncol(x)))
rownames(x) <- LETTERS[seq_len(nrow(x))]
x <- t(x)
x[, "A"] <- sqrt(x[, "A"])
for (i in seq_along(colnames(x))) {
set.seed(17)
x[, i] <- x[, i] + runif(length(x[, i]), -1, 1)
}
x2 <- lapply(x2, function(x) x[1:10])
x2 <- do.call(rbind, x2)
x[, 1] <- x[, 1] * 2 + 10
x[, 2] <- x[, 1] + x[, 2]
x[, "A"] <- x[, "A"] * 2
x <- data.frame(x)
fit_sqrt_A <- lm(I(sqrt(A)) ~ B, data = x)
z <- c(rep("A", nrow(x2) / 2), rep("B", nrow(x2) / 2))
fit_anova <- aov(x2 ~ z, data = data.frame(x2 = x2[, 1], z))
```

Currently, libraries can not be isolated directly during the cleaning

process. So, the `get_libs()`

function provides a way to detect the

libraries for a given script. We just need to supply a file path and

`get_libs()`

will return the libraries that are called by that script.

```
get_libs(script)
```

```
[1] "stats"
```

### The provenance engine under the hood

The `clean()`

function provides an effective way to remove code that is

unwanted; however, many researchers are wary about doing this exact

thing for at least a few reasons. Perhaps the top reason is that the

main goal of an analysis is the results and taking time to craft

transparent, dependable software is not the priority. As such, taking

time to go back through a script and remove code is time

wasted. Relatedly, for most researchers the best way to keep track of

the various analyses that they have explored is to keep them in the

script, as they do not use a rigorous version control system but

instead rely on file backups and informal versioning. Although we

can’t give researchers more hours in the day, providing an easier and

more reliable means to remove unused code will lower the barrier to

creating better, cleaner code. Combined with the increasing use of

version control systems and digital notebooks, the practice of

“saving” analytical ideas in a script will become less common and code

quality will increase.

The process that Rclean uses

relies on the generation of data provenance. The term provenance

means information about the origins of some object. Data provenance is

a formal representation of the execution of a computational process, to rigorously determine the the

unique computational pathway from inputs to results ^{3}. To

avoid confusion, note that “data” in this context is used in a broad

sense to include all of the information generated during computation,

not just the data that are collected in a research project that are

used as input to an analysis. Having the formalized, mathematically

rigorous representation that data provenance provides guarantees that

analyses conducted by Rclean

are theoretically sound. Most importantly, because the relationships

defined by the provenance can be represented as a graph, it is

possible to apply network search algorithms to determine the minimum

and sufficient code needed to generate the chosen result in the

`clean()`

function.

There are multiple approaches to collecting data provenance, but

Rclean uses “prospective”

provenance, which analyzes code and uses language-specific information

to predict the relationship among processes and data

objects. Rclean relies on an

R package called CodeDepends to gather the prospective provenance for

each script. For more information on the mechanics of the CodeDepends

package, see ^{4}. To get an idea of what data provenance

is, we can use the `code_graph()`

function to get a graphical representation

of the prospective provenance generated for

Rclean.

All of this work with the provenance is to get the network

representation of relationships among functions and objects. The

provenance network is very powerful because we can now apply

algorithms to analyze the R script with respect to our results. This is

what empowers the `clean()`

function, which takes the provenance and

applies a network search algorithm to determine the pathways leading

from inputs to outputs. In the process any objects or functions

that do not fall along that pathway are by definition not necessary to

produce the desired set of results and can therefore be removed. As

demonstrated in the example, this property of the provenance network

is what facilitates the robust isolation of the minimal code necessary

to generate the output we want.

One important topic to discuss is that

Rclean *does not* keep comments

present in code. This is the result of a limitation of the data

provenance collection, which currently does not assign them a

relationship in the provenance network. This is a general issue with

detecting the relationships between comments and code. For example,

comments at the end of lines are typically relevant to the line they

are on but this is not a linguistic requirement. Also, comments

occupying their own lines usually refer to the following lines but

this is also not necessarily the case either. In fact comments can

refer to any or none of the code relative to their position in the

script, the latter commonly being the case when code is removed from a

script but comments referring to it have not. The inferred and

explicit meanings of comments are a cultural and not linguistic.

That being said, although Rclean

cannot operate automatically on comments, comments in the original

code remain untouched and can be used to inform the reduced

code. Also, as the `clean()`

function is oriented toward isolating

code based on a specific result, the resulting code tends to naturally

support the generation of new comments that are higher level

(e.g. “The following produces a plot of the mean response of each

treatment group."), and lower level comments are not necessary because

the code is simpler and clearer. This process of commenting is an

important part of writing better code. Lastly, although comments can

serve an important role in coding, it is worth reflecting on the

statement in R.C. Martin’s book *Clean Code: A Handbook of Agile Software Craftsmanship* where he writes that, “Comments do not

compensate for bad code.”

### Concluding remarks and future work

Rclean provides a simple, easy to

use tool for scientists who would like help refactoring code. Using

Rclean, the code necessary to

produce a specified result (e.g., an object stored in memory or a

table or figure written to disk) can be easily and *reliably* isolated

even when tangled with code for other results. Tools, such as this,

that make it easier to produce transparent, accessible code will be an

important aid for improving scientific reproducibility

^{5}.

Although the current implementation of

Rclean for minimizing code is

useful on its own, we see promise in connecting it with other

reproducibility tools. One example is the

reprex package, which provides a simple

API for sharing reproducible examples

^{6}. Rclean could provide

a reliable way to extract parts of a larger script that would be piped

to a simplified reproducible example. Another possibility is to help

transition scripts to functions, packages and workflows refactoring

via toolboxes like drake

^{7}. Since Rclean can

isolate the code from inputs to one or more outputs, it could be used

to extract all of the components needed to write one or more functions

that would be a part of a package or workflow, as is the goal of

drake.

In the future, it would also be useful to extend the existing

framework to support other provenance methods. One possibility is

*retrospective provenance*, which tracks a computational process as it

is executing. Through this active, concurrent monitoring,

retrospective provenance can gather information that static

prospective provenance can’t. Greater details of the computational

process would enable other features that could address some

challenges, such as libraries that are actually used by the code,

processing comments (as discussed above), parsing control statements

and replicating random processes. Using retrospective provenance

comes at a cost, however. In order to gather it, the script needs to

be executed. When scripts are computationally intensive or contain

bugs that stop execution retrospective provenance can not be obtained

for part or all of the code. Although such costs may present

challenges, combining prospective and retrospective provenance methods

could provide a powerful and flexible solution. Some work has already

been done in the direction of implementing retrospective provenance

for code cleaning in R (see http://end-to-end-provenance.github.io);

however, there doesn’t appear to be a tool that synthesizes these two

approaches to provenance.

We hope that Rclean makes writing

scientific software easier for the R community. The package has

already been significantly improved via the rOpenSci review

process,

thanks to the efforts of editor Anna

Krystalli and reviewers, Will

Landau and Clemens

Schmid. We look forward to the future progress

of the package and other “code cleaning” tools. As an open-source

project, we would like to encourage feedback and help with extending

the package. We invite people to use the package and get involved by

reporting bugs and suggesting or (hopefully) contributing

features. For more information please visit the project page on

GitHub.

### Acknowledgements

Thanks to Steffi Lazerte, Stefanie

Butland and Maëlle

Salmon for editorial work and technical

help. Special word of thanks to Maëlle

Salmon for the addition of the code

highlighting figure in the `clean()`

example.

Lau M, Pasquier TFJ, Seltzer M (2020). “Rclean: A Tool for

Writing Cleaner, More Transparent Code.”*Journal of Open Source*,

Software*5*(46), 1312. https://doi.org/10.21105/joss.01312 ,

https://doi.org/10.21105/joss.01312. ↩︎Wickham H, Hester J, Chang W (2019).

*devtools: Tools to*.

Make Developing R Packages Easier

https://CRAN.R-project.org/package=devtools. ↩︎Carata L, Akoush S, Balakrishnan N, Bytheway T, Sohan R,

Seltzer M, Hopper A (2014). “A Primer on Provenance.”*Queue*,*12*(3),

10-23. ISSN 15427730,

http://dl.acm.org/citation.cfm?doid=2602649.2602651 ,

https://doi.org/10.1145/2602649.2602651. ↩︎Temple Lang D, Peng R, Nolan D, Becker G (2020).

*CodeDepends: Analysis of R Code for Reproducible Research and Code*. https://github.com/duncantl/CodeDepends. ↩︎

ComprehensionPasquier T, Lau MK, Trisovic A, Boose ER, Couturier B,

Crosas M, Ellison AM, Gibson V, Jones CR, Seltzer M (2017). “If these

data could talk.”*Scientific Data*,*4*, 170114. ISSN 2052-4463,

http://www.nature.com/articles/sdata2017114 ,

https://doi.org/10.1038/sdata.2017.114. ↩︎Bryan J, Hester J, Robinson D, Wickham H (2019).

*reprex:*.

Prepare Reproducible Example Code via the Clipboard

https://CRAN.R-project.org/package=reprex. ↩︎Landau WM (2020).

*drake: A Pipeline Toolkit for*.

Reproducible Computation at Scale

https://CRAN.R-project.org/package=drake. ↩︎

**leave a comment**for the author, please follow the link and comment on their blog:

**rOpenSci - open tools for open science**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.