This post talks about my workflow for getting started with a new data analysis project using the
Overview of ProjectTemplate
ProjectTemplate is an R Package which facilitates data analysis, encourages good data analysis habits, and standardises many data analytic steps. After many years of refining a data analysis workflow in R, I realised that I'd basically converged on something similar to ProjectTemplate anyway. However, my approach was not quite as systematic, and it took more effort than necessary to get started on a new project. Thus, since late 2013, I've been using ProjectTemplate to organise my R data analysis projects.
While I have found ProjectTemplate to be an excellent tool, I realised that when I created a new data analysis project based on ProjectTemplate, I was repeatedly making a large number of customisations to the initial set of files and folders. Thus, I've now set up a repository to store these customisations so that I can get started on a new data analysis project more efficiently. The purpose of this post is to document these modifications.
This post assumes a reasonable knowledge of R and ProjectTemplate. If you're not familiar with ProjectTemplate, you could check out the ProjectTemplate website focusing particularly on the Getting Started section. If you're really keen you could also watch an hour long video on ProjectTemplate, RStudio, and GitHub
I have a copy of my customised version of the ProjectTemplate directory and file structure on github in the AnglimModifiedProjectTemplate repository. Specifically, it has:
- Modifications to
global.dcfas described below,
- a blank
- a couple of directories removed that I don't use (e.g.,
- an initial
rmdfile with the customisations mentioned below in the
.RprojRStudio project file to enable easy launching of RStudio.
- An additional
outputdirectory for storing tabular, text, and other output
Thus, whenever I want to start a new data analysis project I can download and extract the zip file of the repository on github).
Thus, after creating a project folder, the following steps can be skipped when using my customised template.
- Open RStudio and create RStudio Project in existing directory
ProjectTemplatefolder structure with
- Move ProjectTemplate files into folder
- Setup rmd reports
I also document below a few additional points about subsequent steps including:
- Setting up the data directory
- Updating the readme file
- Setttig up git repository
My preferred starting
global.dcf settings are
libraries: psych, lattice, Hmisc
A little explanation:
as_factorsI do quite a bit of string processing, particularly on meta data and on output tables. I find the automatic conversion of strings into factors to be a really annoying feature. Thus, setting this to
offis my preferred setting.
load_libraries:I always have additional libraries so it makes sense to have this
libraries:There are many common packages that I use, but I almost always make use of the above comma separate list of packages.
Setup rmd files
Basics of such files
I generally create a couple of
rmd files in the
reports directory (if you're unfamiliar with RMarkdown, see this earlier post on RMarkdown). The first line in the first chunk is always:
This loads everything required to get started with the project.
RMarkdown in reports
In ProjectTemplate, you would typically store RMarkdown documents in the
reports directory. However, if you then try to compile that file in RStudio, you will realise that RStudio will treat the directory that contains the RMarkdown file as the working directory. In order to ensure that the working directory is the same as the project directory, add the following text to the top of your RMarkdown file.
- backtick r and then backtick delimits inline r code chunks; these general Rmarkdown options need to be in this format and not in a standard rmarkdown code chunk
- opts_knit$set() is the way to set general rmarkdown options.
- '..' sets the working directory to one higher than the default.
Setup data folder
ProjectTemplate automatically names resulting data.frames with a name based on the file name. This is convenient. However, it is often the case that the file names need to be changed from some raw data supplied or it may be that the original data format is not perfectly suited for importing. In that case, I store the raw data in a separate folder called
raw-data and then export or create a copy in the desired format with the desired name in the
Overriding default data import options
Some data files can not be imported using the default data import rules. Of course, you can change the file to comply with the rules. Alternatively, I think the standard solution is to add a file in the
lib directory (e.g.,
data-override.r) that imports the data files. Give the imported data file the same name that ProjectTemplate would.
I change the file to README.md to make it clear that it is a markdown formatted file. I can then add a little information about the project.
Setup git repository
If using github, I create a new repository on github.
A common workflow for me is to generate tables, text, and figure output fromthe script which is then incorporated into a manuscript document. While I really like Sweave and RMarkdown, I often find it more practical to write a manuscript in Microsoft Word. I use the
output folder to store tabular output, standard text output, and figures.
In the case of tabular output, there is the task of ensuring the table is formatted appropriately (e.g., desired number of decimal places, cell alignment, cell borders, font, cell merging, etc.). I typically find this easiest to do in Excel. Thus, I have a file called
output-processing.xlsx. I import the tabular data into this file and apply relevant formatting. This can then be incorporated into the manuscript. Here are a few more notes about Table conversion in MS Word.