Structuring R projects
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are some things that I call Smith goods:1) things I want, nay, require, but hate doing. A clean room is one of these – I have a visceral need to have some semblance of tidiness around me, I just absolutely hate tidying, especially in the summer. Starting and structuring packages and projects is another of these things, which is why I’m so happy things like cookiecutter exist that do it for you in Python.
While I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price.
I am famously laid back about structuring R projects – my chill attitude is only occasionally compared to the Holy Inquisition, the other Holy Inquisition and Gunny R. Lee Ermey’s portrayal of Drill Sgt. Hartman, and it’s been months since I last gutted an intern for messing up namespaces.2) So while I don’t like structuring R projects, I keep doing it, because I know it matters. That’s a pearl of wisdom that came occasionally at a great price, some of which I am hoping to save you by this post.
Five principles of structuring R projects
Every R project is different. Therefore, when structuring R projects, there has to be a lot more adaptability than there is normally When structuring R projects, I try to follow five overarching principles.
- The project determines the structure. In a small exploratory data analysis (EDA) project, you might have some leeway as to structural features that you might not have when writing safety-critical or autonomously running code. This variability in R – reflective of the diversity of its use – means that it’s hard to devise a boilerplate that’s universally applicable to all kinds of projects.
- Structure is a means to an end, not an end in itself. The reason why gutting interns, scalping them or yelling at them Gunny style are inadvisable is not just the additional paperwork it creates for HR. Rather, the point of the whole exercise is to create people who understand whythe rules exists and organically adopt them, understanding how they help.
- Rules are good, tools are better. When tools are provided that take the burden of adherence – linters, structure generators like cookiecutter, IDE plugins, &c. – off the developer, adherence is both more likely and simpler. And while you’re at it, make sure to write error messages that do not suck.
- Structures should be interpretable to a wide range of collaborators. Even if you have no collaborators, thinking from the perspective of an analyst, a data scientist, a modeller, a data engineer and, most importantly, the client who will at the very end receive the overall product.
- Structures should be capable of evolution. Your project may change objectives, it may evolve, it may change. What was a pet project might become a client product. What was designed to be a massive, error-resilient superstructure might have to scale down. And most importantly, your single-player adventure may end up turning into an MMORPG. Your structure has to be able to roll with the punches.
A good starting structure
Pretty much every R project can be imagined as a sort of process: data gets ingested, magic happens, then the results – analyses, processed data, and so on – get spit out. The absolute minimum structure reflects this:
. └── my_awesome_project ├── src ├── output ├── data │ ├── raw │ └── processed ├── README.md ├── run_analyses.R └── .gitignore
In this structure, we see this reflected by having a data/ folder (a source), a folder for the code that performs the operations (src/) and a place to put the results (output/). The root analysis file (the sole R file on the top level) is responsible for launching and orchestrating the functions defined in the src/ folder’s contents.
The data
folder
The data
folder is, unsurprisingly, where your data goes. In many cases, you may not have any file-formatted raw data (e.g. where the raw data is accessed via a *DBC
connection to a database), and you might even keep all intermediate files there, although that’s pretty uncommon on the whole, and might not make you the local DBA’s favourite (not to mention data protection issues). So while theraw/
subfolder might be dispensed with, you’ll most definitely need a data/ folder.
When it comes to data, it is crucial to make a distinction between source data and generated data. Rich Fitzjohn puts it best when he says to treat
- source data as read-only, and
- generated data as disposable.
The preferred implementation I have adopted is to have
- a
data/raw/
folder, which is usually is symlinked to a folder that is write-only to clients but read-only to the R user,3) - a
data/temp/
folder, which contains temp data, and - a
data/output/
folder, if warranted.
The src folder
Some call this folder R
– I find this a misleading practice, as you might have C++, bash
and other non-R code in it, but is unfortunately enforced by R if you want to structure your project as a valid R package, which I advocate in some cases. I am a fan of structuring the src/
folder, usually by their logical function. There are two systems of nomenclature that have worked really well for me and people I work with:
- The library model: in this case, the root folder of src/ holds individual .R scripts that when executed will carry out an analysis. There may be one or more such scripts, e.g. for different analyses or different depths of insight. Subfolders of src/ are named after the kind of scripts they contain, e.g. ETL, transformation, plotting. The risk with this structure is that sometimes it’s tricky to remember what’s where, so descriptive file names are particularly important.
- The pipeline model: in this case, there is a main runner script or potentially a small number. These go through scripts in a sequence. It is a sensible idea in such a case to establish sequential subfolders or sequentially numbered scripts that are executed in sequence. Typically, this model performs better if there are at most a handful distinct pipelines.
Whichever approach you adopt, a crucial point is to keep function definition and application separate. This means that only the pipeline or the runner scripts are allowed to execute (apply) functions, and other files are merely supposed to define them. Typically, folder level segregation works best for this:
- keep all function definitions in subfolders of src/, e.g. src/data_engineering, and have the directly-executable scripts directly under src/ (this works better for larger projects), or
- keep function definitions in src/, and keep the directly executable scripts in the root folder (this is more convenient for smaller projects, where perhaps the entire data engineering part is not much more than a single script).
output and other output folders
Output may mean a range of things, depending on the nature of your project. It can be anything from a whole D.Phil thesis written in a LaTeX-compliant form to a brief report to a client. There are a couple of conventions with regard to output folders that are useful to keep in mind.
Separating plot output
My personal preference is that plot output folders should be subfolders of output/
, rather than top-tier folders, unless the plots themselves are the objective.
It is common to have a separate folder for plots (usually called figs/
or plots/
), usually so that they could be used for various purposes. My personal preference is that plot output folders should be subfolders of output folders, rather than top-tier folders, unless they are the very output of the project. That is the case, for instance, where the project is intended to create a particular plot on a regular basis. This was the case, for instance, with the CBRD project whose purpose was to regularly generate daily epicurves for the DRC Zaire ebolavirus outbreak.
With regard to maps, in general, the principle that has worked best for teams I ran was to treat static maps as plots. However, dynamic maps (e.g. LeafletJS apps), tilesets, layers or generated files (e.g. GeoJSON files) tend to deserve their own folder.
Reports and reporting
For business users, automatically getting a beautiful PDF report can be priceless.
Not every project needs a reporting folder, but for business users, having a nice, pre-written reporting script that can be run automatically and produces a beautiful PDF report every day can be priceless. A large organisation I worked for in the past used this very well to monitor their Amazon AWS expenditure.4) A team of over fifty data scientists worked on a range of EC2 instances, and runaway spending from provisioning instances that were too big, leaving instances on and data transfer charges resulting from misconfigured instances5) was rampant. So the client wanted daily, weekly, monthly and 10-day rolling usage nicely plotted in a report, by user, highlighting people who would go on the naughty list. This was very well accomplished by an RMarkdown
template that was ‘knit
‘ every day at 0600 and uploaded as an HTML file onto an internal server, so that every user could see who’s been naughty and who’s been nice. EC2 usage costs have gone down by almost 30% in a few weeks, and that was without having to dismember anyone!6)
Probably the only structural rule to keep in mind is to keep reports and reporting code separate. Reports are client products, reporting code is a work product and therefore should reside in src/
.
Requirements and general settings
I am, in general, not a huge fan of outright loading whole packages to begin with. Too many users of R don’t realise that
- you do not need to attach (
library(package)
) a package in order to use a function from it – as long as the package is available to R, you can simply call the function aspackage::function(arg1, arg2, ...)
, and - importing a package using
library(package)
puts every single function from that package into the namespace, overwriting by default all previous entries. This means that in order to deterministically know what any given symbol means, you would have to know, at all times, the order of package imports. Needless to say, there is enough stuff to keep in one’s mind when coding in R to worry about this stuff.
However, some packages might be useful to import, and sometimes it’s useful to have an initialisation script. This may be the case in three particular scenarios:
- You need a particular locale setting, or a particularly crucial environment setting.
- It’s your own package and you know you’re not going to shadow already existing symbols.
- You are not using packrat or some other package management solution, and definitely need to ensure some packages are installed, but prefer not to put the clunky install-if-not-present code in every single thing.
In these cases, it’s sensible to have a file you would source before every top-level script – in an act of shameless thievery from Python, I tend to call this requirements.R, and it includes some fundamental settings I like to rely on, such as setting the locale appropriately. It also includes a CRAN install check script, although I would very much advise the use of Packrat over it, since it’s not version-sensitive.
Themes, house style and other settings
It is common, in addition to all this, to keep some general settings. If your institution has a ‘house style’ for ggplot2
(as, for instance, a ggthemr
file), for instance, this could be part of your project’s config. But where does this best go?I’m a big fan of keeping house styles in separate repos, as this ensures consistency across the board.
It would normally be perfectly fine to keep your settings in a config.R
file at root level, but a config/ folder is much preferred as it prevents clutter if you derive any of your configurations from a git submodule
. I’m a big fan of keeping house styles and other things intended to give a shared appearance to code and outputs (e.g. linting rules, text editor settings, map themes) in separate – and very, very well managed! – repos, as this ensures consistency across the board over time. As a result, most of my projects do have a config folder instead of a single configuration file.
It is paramount to separate project configuration and runtime configuration:
- Project configuration pertains to the project itself, its outputs, schemes, the whole nine yards. For instance, the paper size to use for generated LaTeX documents would normally be a project configuration item. Your project configuration belongs in your
config/
folder. - Runtime configuration pertains to parameters that relate to individual runs. In general, you should aspire to have as few of these, if any, as possible – and if you do, you should keep them as environment variables. But if you do decide to keep them as a file, it’s generally a good idea to keep them at the top level, and store them not as R files but as e.g. JSON files. There are a range of tools that can programmatically edit and change these file formats, while changing R files programmatically is fraught with difficulties.
Keeping runtime configuration editable
A few years ago, I worked on a viral forecasting tool where a range of model parameters to build the forecast from were hardcoded as R variables in a runtime configuration file. It was eventually decided to create a Python-based web interface on top of it, which would allow users to see the results as a dashboard (reading from a database where forecast results would be written) and make adjustments to some of the model parameters. The problem was, that’s really not easy to do with variables in an R file.
On the other hand, Python can easily read a JSON file into memory, change values as requested and export them onto the file system. So instead of that, the web interface would store the parameters in a JSON file, from which R would then read them and execute accordingly. Worked like a charm. Bottom line – configurations are data, and using code to store data is bad form.
Dirty little secrets
Everybody has secrets. In all likelihood, your project is no different: passwords, API keys, database credentials, the works. The first rule of this, of course, is never hardcode credentials in code. But you will need to work out how to make your project work, including via version control, while also not divulging credentials to the world at large. My preferred solutions, in order of preference, are:
- the
keyring
package, which interacts with OS X’s keychain, Windows’s Credential Store and the Secret Service API on Linux (where supported), - using environment variables,
- using a secrets file that is
.gitignore
d, - using a config file that’s
.gitignore
d, - prompting the user.
Let’s take these – except the last one, which you should consider only as a measure of desperation, as it relies on RStudio and your code should aspire to run without it – in turn.
Using keyring
keyring
is an R package that interfaces with the operating system’s keychain management solution, and works without any additional software on OS X and Windows.7) Using keyring
is delightfully simple: it conceives of an individual key
as belonging to a keyring
and identified by a service
. By reference to the service, it can then be retrieved easily once the user has authenticated to access the keychain. It has two drawbacks to be aware of:
- It’s an interactive solution (it has to get access permission for the keychain), so if what you’re after is R code that runs quietly without any intervention, this is not your best bet.
- A key can only contain a username and a password, so it cannot store more complex credentials, such as 4-ple secrets (e.g. in OAuth, where you may have a consumer and a publisher key and secret each). In that case, you could split them into separate keyring keys.
However, for most interactive purposes, keyring
works fine. This includes single-item secrets, e.g. API keys, where you can use some junk as your username and hold only on to the password. For most interactive purposes, keyring
works fine. This includes single-item secrets, e.g. API keys.
By default, the operating system’s ‘main’ keyring is used, but you’re welcome to create a new one for your project. Note that users may be prompted for a keychain password at call time, and it’s helpful if they know what’s going on, so be sure you document your keyring calls well.
To set a key, simply call keyring::key_set(service = "my_awesome_service", username = "my_awesome_user)
. This will launch a dialogue using the host OS’s keychain handler to request authentication to access the relevant keychain (in this case, the system keychain, as no keychain is specified), and you can then retrieve
- the username: using
keyring::key_list("my_awesome_service")[1,2]
, and - the password: using
keyring::key_get("my_awesome_service")
.
Using environment variables
The thing to remember about environment variables is that they’re ‘relatively private’: everyone in the user session will be able to read them.
Using environment variables to hold certain secrets has become extremely popular especially for Dockerised implementations of R code, as envvars can be very easily set using Docker. The thing to remember about environment variables is that they’re ‘relatively private’: they’re not part of the codebase, so they will definitely not accidentally get committed to the VCS, but everyone who has access to the particular user session will be able to read them. This may be an issue when e.g. multiple people are sharing the ec2-user
account on an EC2 instance. The other drawback of envvars is that if there’s a large number of them, setting them can be a pain. R has a little workaround for that: if you create an envfile called .Renviron
in the working directory, it will store values in the environment. So for instance the following .Renviron
file will bind an API key and a username:
api_username = "my_awesome_user" api_key = "e19bb9e938e85e49037518a102860147"
So when you then call Sys.getenv("api_username")
, you get the correct result. It’s worth keeping in mind that the .Renviron
file is sourced once, and once only: at the start of the R session. Thus, obviously, changes made after that will not propagate into the session until it ends and a new session is started. It’s also rather clumsy to edit, although most APIs used to ini
files will, with the occasional grumble, digest .Renviron
s.
Needless to say, committing the .Renviron
file to the VCS is what is sometimes referred to as making a chocolate fireman in the business, and is generally a bad idea.
Using a .gitignore
d config or secrets file
config
is a package that allows you to keep a range of configuration settings outside your code, in a YAML file, then retrieve them. For instance, you can create a default configuration for an API:
default: my_awesome_api: url: 'https://awesome_api.internal' username: 'my_test_user' api_key: 'e19bb9e938e85e49037518a102860147'
From R, you could then access this using the config::get()
function:
my_awesome_api_configuration <- config::get("my_awesome_api")
This would then allow you to e.g. refer to the URL as my_awesome_api_configuration$url,
and the API key as my_awesome_api_configuration$api_key
. As long as the configuration YAML file is kept out of the VCS, all is well. The problem is that not everything in such a configuration file is supposed to be secret. For instance, it makes sense for a database access credentials to have the other credentials DBI::dbConnect()
needs for a connection available to other users, but keep the password private. So .gitignore
ing a config file is not a good idea.A dedicated secrets file is a better place for credentials than a config file, as this file can then be wholesale .gitignore
d.
A somewhat better idea is a secrets file. This file can be safely .gitignore
d, because it definitely only contains secrets. As previously noted, definitely create it using a format that can be widely written (JSON, YAML). For reasons noted in the next subsection, the thing you should definitely not do is creating a secrets file that consists of R variable assignments, however convenient an idea that may appear at first. Because…
Whatever you do…
One of the best ways to mess up is creating a fabulous way of keeping your secret credentials truly secret… then loading them into the global scope. Never, ever assign credentials. Ever.
You might have seen code like this:
dbuser <- Sys.getenv("dbuser") dbpass <- Sys.getenv("dbpass") conn <- DBI::dbConnect(odbc::odbc(), UID = dbuser, PWD = dbpass)
Never, ever put credentials into any environment if possible – especially not into the global scope.
This will work perfectly, except once its done, it will leave the password and the user name, in unencrypted plaintext (!), in the global scope, accessible to any code. That’s not just extremely embarrassing if, say, your wife of ten years discovers that your database password is your World of Warcraft character’s first name, but also a potential security risk. Never put credentials into any environment if possible, and if it has to happen, at least make it happen within a function so that they don’t end up in the global scope. The correct way to do the above would be more akin to this:
create_db_connection <- function() { DBI::dbConnect(odbc::odbc(), UID = Sys.getenv("dbuser"), PWD = Sys.getenv("dbpass")) %>% return() }
Concluding remarks
Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and pratfall error. In many ways, it’s the bellwether of an R developer’s skill trajectory, because it shows whether they possess the tenacity and endurance it takes to do meticulous, fine and often rather boring work in pursuance of future success – or at the very least, an easier time debugging things in the future. Studies show that one of the greatest predictors of success in life is being able to tolerate deferred gratification, and structuring R projects is a pure exercise in that discipline.Structuring R projects is an art, not just a science. Many best practices are highly domain-specific, and learning these generally happens by trial and error.
At the same time, a well-executed structure can save valuable developer time, prevent errors and allow data scientists to focus on the data rather than debugging and trying to find where that damn snippet of code is or scratching their head trying to figure out what a particularly obscurely named function does. What might feel like an utter waste of time has enormous potential to create value, both for the individual, the team and the organisation.As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and useful.
I’m sure there are many aspects of structuring R projects that I have omitted or ignored – in many ways, it is my own experiences that inform and motivate these commentaries on R. Some of these observations are echoed by many authors, others diverge greatly from what’s commonly held wisdom. As with all concepts in development, I encourage you to read widely, get to know as many different ideas about structuring R projects as possible, and synthesise your own style. As long as you keep in mind why structure matters and what its ultimate aims are, you will arrive at a form of order out of chaos that will be productive, collaborative and mutually useful not just for your own development but others’ work as well.
My last commentary on defensive programming in R has spawned a vivid and exciting debate on Reddit, and many have made extremely insightful comments there. I’m deeply grateful for all who have contributed there. I hope you will also consider posting your observations in the comment section below. That way, comments will remain together with the original content.
References
1. | ↑ | As in, Adam Smith. |
2. | ↑ | That’s a joke. I don’t gut interns – they’re valuable resources, HR shuns dismembering your coworkers, it creates paperwork and I liked every intern I’ve ever worked with – but most importantly, once gutted like a fish, they are not going to learn anything new. I prefer gentle, structured discussions on the benefits of good package structure. Please respect your interns – they are the next generation, and you are probably one of their first example of what software development/data science leadership looks like. The waves you set into motion will ripple through generations, well after you’re gone. You better set a good example. |
3. | ↑ | Such a folder is often referred to as a ‘dropbox’, and the typical corresponding octal setting, 0422 , guarantees that the R user will not accidentally overwrite data. |
4. | ↑ | The organisation consented to me telling this story but requested anonymity, a request I honour whenever legally possible. |
5. | ↑ | In case you’re unfamiliar with AWS: it’s a cloud service where elastic computing instances (EC2 instances) reside in ‘regions’, e.g. us-west-1a . There are (small but nonzero) charges for data transfer between regions. If you’re in one region but you configure the yum repo server of another region as your default, there will be costs, and, eventually, tears – provision ten instances with a few GBs worth of downloads, and there’ll be yelling. This is now more or less impossible to do except on purpose, but one must never underestimate what users are capable of from time to time! |
6. | ↑ | Or so I’m told. |
7. | ↑ | Linux users will need libsecret 0.16 or above, and sodium . |
The post Structuring R projects appeared first on Bits and Bugs.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.