Customising ProjectTemplate in R

[This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post talks about my workflow for getting started with a new data analysis project using the ProjectTemplate package.

Overview of ProjectTemplate

ProjectTemplate is an R Package which facilitates data analysis, encourages good data analysis habits, and standardises many data analytic steps. After many years of refining a data analysis workflow in R, I realised that I’d basically converged on something similar to ProjectTemplate anyway. However, my approach was not quite as systematic, and it took more effort than necessary to get started on a new project. Thus, since late 2013, I’ve been using ProjectTemplate to organise my R data analysis projects.

While I have found ProjectTemplate to be an excellent tool, I realised that when I created a new data analysis project based on ProjectTemplate, I was repeatedly making a large number of customisations to the initial set of files and folders. Thus, I’ve now set up a repository to store these customisations so that I can get started on a new data analysis project more efficiently. The purpose of this post is to document these modifications.

This post assumes a reasonable knowledge of R and ProjectTemplate. If you’re not familiar with ProjectTemplate, you could check out the ProjectTemplate website focusing particularly on the Getting Started section. If you’re really keen you could also watch an hour long video on ProjectTemplate, RStudio, and GitHub

General setup

I have a copy of my customised version of the ProjectTemplate directory and file structure on github in the AnglimModifiedProjectTemplate repository. Specifically, it has:

  1. Modifications to global.dcf as described below,
  2. a blank readme.md
  3. a couple of directories removed that I don’t use (e.g., diagnositics, logs, profiling)
  4. an initial rmd file with the customisations mentioned below in the reports directory
  5. An .Rproj RStudio project file to enable easy launching of RStudio.
  6. An additional output directory for storing tabular, text, and other output

Thus, whenever I want to start a new data analysis project I can download and extract the zip file of the repository on github).

Thus, after creating a project folder, the following steps can be skipped when using my customised template.

  • Open RStudio and create RStudio Project in existing directory
  • Create ProjectTemplate folder structure with library(ProjectTemplate); create.project()
  • Move ProjectTemplate files into folder
  • Modify global.dcf
  • Setup rmd reports

I also document below a few additional points about subsequent steps including:

  • Setting up the data directory
  • Updating the readme file
  • Setttig up git repository

Modifying global.dcf

My preferred starting global.dcf settings are

data_loading: on
cache_loading: off
munging: on
logging: off
load_libraries: on
libraries: psych, lattice, Hmisc
as_factors: off
data_tables: off

A little explanation:

  • as_factors I do quite a bit of string processing, particularly on meta data and on output tables. I find the automatic conversion of strings into factors to be a really annoying feature. Thus, setting this to off is my preferred setting.
  • load_libraries: I always have additional libraries so it makes sense to have this on.
  • libraries: There are many common packages that I use, but I almost always make use of the above comma separate list of packages.

Setup rmd files

Basics of such files

I generally create a couple of rmd files in the reports directory (if you’re unfamiliar with RMarkdown, see this earlier post on RMarkdown). The first line in the first chunk is always:

```{r}
library(ProjectTemplate); load.project()
```

This loads everything required to get started with the project.

RMarkdown in reports

In ProjectTemplate, you would typically store RMarkdown documents in the reports directory. However, if you then try to compile that file in RStudio, you will realise that RStudio will treat the directory that contains the RMarkdown file as the working directory. In order to ensure that the working directory is the same as the project directory, add the following text to the top of your RMarkdown file.

`r opts_knit$set(root.dir='..')`

Explanation

  • backtick r and then backtick delimits inline r code chunks; these general Rmarkdown options need to be in this format and not in a standard rmarkdown code chunk
  • opts_knit$set() is the way to set general rmarkdown options.
  • ‘..’ sets the working directory to one higher than the default.

Setup data folder

ProjectTemplate automatically names resulting data.frames with a name based on the file name. This is convenient. However, it is often the case that the file names need to be changed from some raw data supplied or it may be that the original data format is not perfectly suited for importing. In that case, I store the raw data in a separate folder called raw-data and then export or create a copy in the desired format with the desired name in the data folder.

Overriding default data import options

Some data files can not be imported using the default data import rules. Of course, you can change the file to comply with the rules. Alternatively, I think the standard solution is to add a file in the lib directory (e.g., data-override.r) that imports the data files. Give the imported data file the same name that ProjectTemplate would.

Update readme

I change the file to README.md to make it clear that it is a markdown formatted file. I can then add a little information about the project.

Setup git repository

If using github, I create a new repository on github.

Output folder

A common workflow for me is to generate tables, text, and figure output fromthe script which is then incorporated into a manuscript document. While I really like Sweave and RMarkdown, I often find it more practical to write a manuscript in Microsoft Word. I use the output folder to store tabular output, standard text output, and figures.

In the case of tabular output, there is the task of ensuring the table is formatted appropriately (e.g., desired number of decimal places, cell alignment, cell borders, font, cell merging, etc.). I typically find this easiest to do in Excel. Thus, I have a file called output-processing.xlsx. I import the tabular data into this file and apply relevant formatting. This can then be incorporated into the manuscript. Here are a few more notes about Table conversion in MS Word.

To leave a comment for the author, please follow the link and comment on their blog: Jeromy Anglim's Blog: Psychology and Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)