Personal Coding Conventions

February 9, 2018
By

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As a person who’s worked with various programming languages over time, I
have become interested in the nuances and overlaps among languages. In
particular, concepts related to code syntax and organization–everything
from technical concepts such as lexical scoping, to more broad concepts
such as importing and naming data–really fascinate me. Organization
“enthusiasts” like me truly appreciate software/applications that follow
consistent norms.

In the R community, the tidyverse ecosystem has become extremely
popular because it implements a consistent, intuitive framework that is,
consequently, easy to learn. Even aside from the tidyverse, R’s
system for package development, documentation, etc. has strong
community-wide standards. 1 In all,
there’s something magical about R’s framework that has made it my
favorite language to use.

While I tend to follow most of the R community’s “best practices”, there
are certain “traditional” or well-agreed upon aspects of R programming
where I knowingly deviate from the norm. I think
spelling out exactly why I follow the norm on some fronts and not on
others is a good exercise in self-understanding, if not just a fun
exercise in seeing how many people disagree with me.

Setup

  • Use quotes (i.e standard evaluation) when importing a package.
library("dplyr")
# instead of
library(dplyr)

I believe most people use the non-standard evaluation (NSE) approach,
but I like not having to use NSE when it is not really intended to be
used as the primary means of invoking a function. (Note that with some
packages, such as dplyr, NSE is intended to be the primary means
of working with variables and functions. In that case, I would not go
out of my way to invoke standard evaluation methods.)

  • Import tidyverse packages explicitly instead of importing the
    tidyverse package itself.
library("dplyr")
library("stringr")
# instead of 
library("tidyverse")

Again, I think I’m probably in the minority on this one (although most
people may not even have an opinion on which method is “best”). One can
argue that the whole point of bundling the six or so packages that get
loaded with library("tidyverse") is to “abstract away” having to think
about what packages will be necessary for a specific analysis.

  • Do not import a package if it is only being used once or twice.
    Instead, use the package::function notation.
iris_cleaned <- janitor::clean_names(iris)
# instead of
library("janitor")
iris_cleaned <- clean_names(iris)
# ... Then go on to not use janitor again.

Additionally, I may sometimes use the package::function notation even
after importing a package if the function name might be confused with a
variable name. For example, I might specify lubridate::year() even
after calling library("lubridate") because year might seem like a
verb. (There’s more on that in another section.)

I’m not really sure if there is a general consensus or not on this
matter.

  • Favor sprintf() over paste() (or another similar function) if
    inserting numerics or strings in some kinds of “boiler-plate”
    string.

I think sprintf() offers slightly better “find-and-replace” syntax,
which can be useful when re-using code or even when debugging. Also, I
think that it makes the underlying string more readable in the case that
there are a lot of variables to substitute in to the string.
Nonetheless, my justification in this matter is not unwavering–one can
argue that paste() has the advantage of more clarity depending on the
length of the boiler-plate string and the number of variables to be
inserted. (In fact, I often find myself switching among sprintf(),
paste(), paste0(), and stringr::str_c() with no real discretion,
so I can by hypocritical on this matter.)

person <- "Tony"
sprintf("%s is a wonderful person.", person)
# is not really any easier or more explicit than
paste(person, "is a wonderful person.")

# but for a case like this
team_1 <- "Yankees"
team_2 <- "Red Sox"
score_1 <- 3
score_2 <- 4

sprintf("The %s defeated the %s by a score of %0.f to %0.f.", team_1, team_2, score_1, score_2)
# is probably better than 
paste("The", team_1, "defeated the", team_2, "a score of ", score_1, "to", score_2, ".")
  • Favor stringr functions over their base R equivalents (i.e.
    gsub, grep, etc.).

My behavior in this matter is probably not all that controversial, but I
just wanted to make a note of it because I sometimes catch myself using
functions like gsub() and grep() in “tidy” pipelines, only to
realize that a str_replace() or str_detect() would save me some
effort because I don’t have to use a . to anonymously specify the
x/string argument. Nonetheless, I have found that using the
stringr functions is not necessarily second nature for me because the
base R methods shown in many Stack Overflow
have ingrained in me non-tidyverse style. (I should be clear–these
other techniques are not “bad” or “incorrect” in any way.)

Naming

Files and Folders

For naming files and folders, Jenny Bryan’s informative presentation
“Naming
Things”

is a must-read. I do my best to follow the principles she outlines,
while also adding in some of my own “flair”.

  • Use “-” as a separator in file names to distinguish different
    components (e.g. ...scrape-[website]-.... Otherwise, use an
    underscore “_” as a separator for words that make up a single
    topic/entity/component (e.g. ...-new_zealand-...)

I don’t think my choices in this matter are too controversial.

  • Add a numeric prefix for files that are to be executed in a
    pre-defined order.

01-scrape.R
02-process.R
03-analyze.R
04-report.R

I don’t think my convention is unusual here–there is no conflict with
Jenny’s principles.

Notably, I like to add a “main” file, i.e. 00-main.R, that sources
each of the files in a workflow. Something like GNU Make could also be
used, but I like to stick with R tools exclusively. In that case, an R
package like remake or drake could be used, but typically my project
workflow is not so complex that it necessitates the added functionality
that these packages present. Jenny discusses these things in her All
the Automation Things
Topics
page
Topics page for her
class.

Sometimes my project has a “configuration”, i.e. config.R or
“functions” functions-analyze.R file that might need to be “sourced”
at the beginning of a workflow. I tend to not added numeric prefixes to
these files, and I source them in at the beginning of my “main” script.

  • Add numbers for file suffixes, i.e. -v01, -v02, etc., for
    versioning.

This practice of mine is probably atypical. I realize that this is not
really necessary if using git or some other version control service, but
I have a difficult time completely deleting a file that I am not 100%
sure I will never need to look back at. (I generally oppose “hoarding”
of things in this manner, but some habits are hard to break).) In my
opinion, keeping the file is simply easier than restoring previous
snapshots in a git life cycle. Once I’m completely convinced that I no
longer have use for an “old” file, then I delete it.

Functions

For functions, I do my best to implement the principles outlined in the
Functions chapter of the R for Data Science
e-book
. The notion of using verbs
for function names (unless the function is so simple that a noun might
be more appropriate) makes a lot of sense to me.

Perhaps the reader might find it interesting that I like to use the verb
prefix compute_ for functions that use dplyr functions in a piped
action simply because it is not extremely common and, consequently, not
likely to interfere with an existing function name from an imported
package. I think my logic on this matter is in tune with general
practices–many packages nowadays use package-specific prefixes for their
functions in order to distinguish them (e.g. the str_ prefix for
stringr functions).

Additionally, sometimes I’ll use the simple verb prefix get_ (a la the
get and set methods of an object class in an object-oriented
programming framework) if the function is fairly simple and there is no
other intuitive name to use.

Variables

I think, in general, my conventions with variables are “acceptable”.
Personally, I implement “snake case” for my variable names, although I
have no qualms with those who use some other convention (e.g. “Camel
Case”) as long as they are consistent. Interestingly, I went through a
phase with R programming where I implemented a form of the Visual Basic
style of prefixing variable names with the class type of the variable
(i.e. strVariable for a string, intVariable for an integer, etc.).
2 I was prefixing my character variables
with ch_, data-like variables (i.e. data.frame/matrix with d_,
etc.). This practice is probably not all too abhorrent, but I think its
more verbose than is necessary, and goes against the original
inspiration for R, a language created by statisticians in order to
make scientific work more natural for those with little to no experience
with computer science.

Variables for Files and Folders

I tend to break up the names of files into the following components:

  • dir: Directory. i.e. what would be returned by
    gsub(basename(.), "", .)
  • filename: Name of the file, without the directory or extension.
    i.e. what would be returned by basename(.).
  • ext: File extension, without the “.”. i.e. What would be returned
    by tools::file_ext(.).

When combining these three components, I name the variable filepath
(using the OS-agnostic file.path() function). Perhaps the more
traditional name for such a thing is simply path, but I tend to think
that that term is a bit ambiguous.

# Should be a relative path!
dir <- "/path/to/file/"
filename <- "mysupercoolfile"
ext <- "txt"
filepath <- file.path(dir, paste0(filename, ".", ext))
# Don't care if an extra forward-slash is added here. It will still work.
filepath
## [1] "/path/to/file//mysupercoolfile.txt"

Variables for Dates

Because I often work with dates, I have an opinion on naming
date-related variables.

  • I like to use yyyymmdd (as opposed to something maybe more
    intuitive like date or ymd) to refer to a date in the ISO 8601
    international standard format (i.e. “2017-03-14” for March 3, 2017).

I feel that this format is more explicit and stands out. Similarly, if
assigning a number for a year, month, or day, I like to use yyyy,
mm, and dd respectively (even if the month or day is a single
digit). Others may reason to use year, month, and day, but these
names conflict directly with functions in the lubridate package (which
I often use for converting numbers representing some kind of time/date).
The very short y, m, and d might be another option, but I think
each can be can be extremely ambiguous. Is d a day or date? Or data?
Is m a month or minute? Is y a year or the response variable in a
regression model?

Likewise, my logic advocating the use of yyyymmdd for dates explains
my preference for hhmmss for time variables (and hh, mm, ss for
individual components of time-related variables).

Other Objects

  • I prefer to use colname instead of col_name when referencing the
    name of a column in a data.frame/matrix. Since I don’t think
    col (for column) and name really represent separate entities, I
    don’t think it’s best to separate them with an underscore. (This
    convention also explains my preference for filename instead of
    file_name, as well as filepath instead of file_path.)

My stance on this matter might be “out of sync” with the tidyverse
mantra. Note that the readr package uses col_names as a parameter.

  • I prefer col instead of var when referencing the values in a
    data.frame/matrix. var can be ambiguous. (Is it a variable in
    a formula?)

I think I’ve seen someone high-up in the R community (Hadley?) mention
that they prefer col over var as well, so maybe my practice is
“advisable”, if not the norm.

  • I prefer data as the name of the data parameter in a function
    (assuming that the operation is being performed on a data.frame),
    as opposed to df,dat, or d. In the case of a function that
    operates on columns/rows in a data.frame/matrix, I prefer the
    name x.

I think there’s probably a wide range of styles that people have
regarding this specific matter, especially the data parameter.

  • I try to truncate longer names or actions to 3 or 4 letters. For
    example, For example, I often use proc instead of processed to
    indicate that a data object has been transformed in some manner. (In
    this example, proc would be appended on the name of the input
    variable (i.e. data_proc if the original variable is data.) Some
    of my other “go-to”’s include tidy and summ (instead of
    summarized). Notably, I like to pair trn and tst when working
    with training and testing sets for machine learning.

  • I like to use out for the output variable of a function,
    irregardless of what the function does (assuming that there are more
    than a single action in the function, which would require assignment
    to a variable).

I think this is fairly common. In fact, I think I picked up this
convention when studying other people’s package code.

  • I like to end variables with the “s” suffix when I know that it is a
    vector. For example, I might use yyyymmdds for a vector of dates.
yyyymmdds <- c(as.Date("2018-01-01"), as.Date("2018-02-01"))

This technique can sometimes result in awkward names, which makes me
think that most people would be against it. (Perhaps the more likely
case is that most people would see the difference as trivial and ask me
why I’m thinking so hard about this anyways.)

Projects and Workflow

I draw heavy inspiration on my format of a project skeleton from the
ideas laid out in Nice R’s blog post on
projects
and in
Carl Boettiger’s description of his
workflow
.

  • For project skeleton, emulate the structure of an R package as
    closely as possible.

To be explicit, my projects tend to look like this.

[project_name]/
|--data/
|--data-raw/
|--docs/
|--figs/
|--output/
|--R/

A short explanation of each: + data/: Data created from scripts
(e.g. a web scraper) that are not intended to be shared as part of
the project’s output. Preferably stored in an R data format (e.g. .rds
file). + data-raw/: Manually-collected data. Probably saved as a .csv
or .xlsx file. Appropriate location for SQL files and output if a
database connection (i.e. with a company/enterprise database) cannot be
made directly through R. + docs/: Documentation of the project.
Preferably RMarkdown or Markdown file(s). + figs/: Ad-hoc
visualizations created interactively that are not designated for the
project’s output. + output/: Deliverables. Preferably RMarkdown files
or files knitted from RMarkdown. Possibly includes nested data or
figs folders. + R/: R code. .R files that function primarily as
scripts or host a set of functions that are “sourced” into a script file
are appropriate.

(See Hadley Wickham’s book on R Packages
for more explanation.)

Probably the most significant difference between my structure and that
of an R package is the output/ folder (which might be considered the
analogue of a vignettes/ folder in an R package).

Regarding project documentation, I prefer the docs/ name for the
documentation folder, as opposed to the R package standard of man/. In
my opinion, docs/ is more appropriate because it seems to be the
standard name for a folder hosting extensive documentation of a package
(as opposed to man/, which is for package functions). (Additionally, I
should note that the pkgdown R package for creating a website for an R
package and its documentation generates a docs/ folder.)

Finally, in compliance with standard R package development, one should
add folders for src/ and tests/ for C code and testing scripts. (I
personally have never written C code or tests for my projects.)

  • Create an old/ folder in the same location that a no-longer-used
    file was previously used.

This practice of mine goes hand-in-hand with my “manual” versioning of
files. When I begin using a newer version of a script, I’ll typically
put the previous version(s) in an old/ folder.

Conclusion

In general, Hadley Wickham and Jenny Bryan are the “go-to” knowledge
resources in the R community on all things related to R programming
style, practices, etc.

Of course, these resources represent only a very small subset of amazing
resources out there. I know I could go on for days reading about style,
syntax, etc., but maybe that’s just me. 3



  1. Of course, other programming languages/communities have their own sets of standards that deserve merit.
    ^
  2. Note that Camel Case is the preferred style amongst the Visual Basic community.
    ^
  3. If you enjoy this topic, then you probably enjoy classic software-related debates such as “spaces vs. tabs” and “R vs. python”.
    ^

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)