Good Parameterisation in R

[This article was first published on some real numbers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Imagine you work in a large factory that produces complicated widgets. It is your job to control production line settings which must be reset each day so as to ensure the smooth operation of the factory. However, to change the settings you have to walk around turning dials and pressing buttons at various different locations on the factory floor.

One morning you forget to turn the dial on an important machine causing the production line to completely shut down. Your manager storms over and you explain to him that it would be so much easier if you could just change the settings from one location!

One can think of a data science solution, automated or otherwise, as a factory which takes data as an input and produces insight. Best practice is to parameterise all code and sensibly place these parameters so that you can easily find and change them. Parameterised code is especially important when it comes to dashboard development. When a user interacts with a visual display, they should not be presented with a series of hard-coded outputs, rather, they should be changing parameters that result in uniquely generated results.

You pretty much have three options with respect to your code:

Unparameterised

Hard-coded settings are dispersed throughout your script and in order to change them, one must trawl right though it. This style of script is very prone to find-and-replace errors and is a nightmare to handover if you were to leave your job.

Partially Parameterised

Settings can all be found in a logical place such as the beginning of your script or in a separate file that is sourced in making them easy to change. This set-up also helps future users (including your future self!) fully understand what is going on.

Fully parameterised

Functions are defined and any parameters are set as arguments to the function. This is the most elegant solution.

The example below illustrates these three options. The script takes a data frame, selects only numeric fields, calls the k-means algorithm and plots a coloured chart displaying cluster allocations. There are three settings that can be changed: the data frame, the number of clusters and the chart title.

The first code block is an example of unparameterised code as the user must change all three settings manually by finding them in the script. The second code block is partially parameterised, setting nClust at the beginning selects k while simultaneously altering the chart title. The final code block wraps everything into a function, dealing with all three settings through specified arguments.

cluster_solution


To leave a comment for the author, please follow the link and comment on their blog: some real numbers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)