Project management for scalable data analysis

August 30, 2017
By

(This article was first published on R on francojc ⟲, and kindly contributed to R-bloggers)

Project management

This post can really be seen as an extension of the last post Getting started with R and RStudio in that we will be getting to know some more advanced, but indispensable features of RStudio. These features, in combination with some organizational and programming strategies, will enable us to conduct efficient data analysis and set the stage for research is is both scalable and ready for sharing with either collaborators or the research community.

To understand the value of this approach to project management we need to get a bird’s eye view of the key steps in a data science project. There are three main areas that any research project includes: data organization, data analysis, and reporting results.

These three main areas have important subareas as well:

  1. Data organization:
  • acquiring data whether that be from a database or through a download or webscrape
  • curating data so that it will be reliable for subsequent steps in the process
  • transforming the data into a format that will facilitate the analysis of the data including the generation of new variables and measures.
  1. Data analysis:
  • visualizing data in summary tables and graphics to gain a better sense of the distribution of the question that we are aiming to learn about.
  • statistical tests to provide confirmation of the distribution(s) that we are investigating and/or a more robust understanding of the relationships between variables in our data
  1. Reporting results:
  • Communicating findings in an appropriate format for the venue. This can be standard article format, or slides, or as a webpage.
  • Preparing reproducible results by ensuring that our work is well documented and capable of being replicated

By taking these steps into account in the organization of our project, we will be able to work more efficiently and effectively with R and RStudio. In the next section we will get set up with a model template for organizing a data science project. This structure will serve as our base for working in subsequent posts in this series.

Project structure

As a starting point, we will download an existing project that I have created. This project will work as a template for implementing good project management with RStudio. The project template is housed remotely on the code sharing platform GitHub which leverages the git versioning software to make local projects remotely accessible. After copying the project to your local machine, we will link RStudio to the project and continue our exploration of the various features available in this software to manage projects.

Downloading the template

To download this project template, we will need to set up git on your machine first. I will have much more to say about git and GitHub in a later post in the series that will directly concern the process of versioning and sharing reproducible research, but for now we’ll only cover what is needed to get our project template from GitHub. If you are on a Mac, git is most likely already on your machine. To verify that this is the case you can open up the ‘Terminal.app’ on your machine and type this command at the prompt:

which git

The ‘Terminal.app’ on Mac, ‘Terminal’ on Linux, and ‘git bash’ on Windows all interface what is called the ‘Shell’, or command-line interface. This is a interface to your computer not unlike the Console is to R in the R GUI Application or RStudio. There are various environments with particular syntax conventions for working with your computer through this interface, the most common being the ‘Bash’ shell. I encourage you to learn some the basics of using the command-line.

If git is already installed, then a path to this software, similar to the one below, will be returned.

/usr/local/bin/git

If a path is not returned, or you are on a PC, then you will need to download and install the software from the git homepage. Follow this link to the git downloads page.


Git downloads page.

Figure 1: Git downloads page.

Download the installer for you operating system and run this installer with the default installation recommendations.

Now that you have git on your machine, let’s set some global options that will personalize your software while since we are already working at the command line. If you are on a Mac or Linux, open the ‘Terminal’. If you are on a Windows machine, navigate to the programs menu and open ‘git bash’.

At the terminal, enter the following commands –replacing ‘Your Name’ and ‘[email protected]’ with your information.

git config --global user.name 'Your Name'
git config --global user.email '[email protected]'

Next, let’s download the project, ‘recipes-project_template’ from my personal GitHub repository. To do this, you will want to first decide where on your machine you will like to store this project. If you are following the Recipe series, I recommend that you create a new directory called Recipes/ somewhere convenient on your hard disk. You can then use this directory to house this and other upcoming projects associated with posts in this series.

To create this directory, I’ll use the mkdir command, or ‘make directory’. The ~ is a shortcut operator for the current users home directory. On my Mac the full path would be /Users/francojc/Documents/Recipes/.

mkdir ~/Documents/Recipes/

Quick tip: When typing a path at the command line you can start typing a directory name and hit the tab key on your keyboard to autofill the full name. If you hit tab twice in a row, the bash shell will list the available subdirectory paths. This can speed up navigation at the command line and help avoid typographical errors.

Once we have created the main directory Recipes/ to house the repository, we need to navigate to that directory using cd, or ‘change directory’.

cd ~/Documents/Recipes/

Verify that your current working directory is correct by entering pwd, or ‘path to working directory’ to get the current directory’s path.

pwd

The result should print the path to your Recipes/ directory.

Now we are ready to use git to copy the remote project from my GitHub repository recipes-project_template to our current directory. Enter the following command at the terminal prompt.

git clone https://github.com/francojc/recipes-project_template.git

You should get some information about the download, or clone, that looks something similar to the output below.

Cloning into 'recipes-project_template'...
remote: Counting objects: 48, done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 48 (delta 21), reused 38 (delta 15), pack-reused 0
Unpacking objects: 100% (48/48), done.

Now cd into the recipes-project_template/ directory that you just cloned into your Recipes/ directory.

You can now inspect the new directory, subdirectories, and files that now reside in the recipes-project_template/ directory with either at the command line, or with your operating systems file explorer. At the command line you can use the ls, or ‘list structure’ command like so:

ls

You should see the following output.

README.md   code        figures     log
_pipeline.R data        functions   report

We can now leave the Terminal, or git bash, and return to RStudio. We are now ready to link an R project to our cloned project.

Creating an R Project within RStudio

As we have seen, RStudio provides a of host tools for facilitating work with R. A feature that makes working with the sometimes numerous data files, scripts, and other resources more manageable is the ‘R Project’ tool. In a nutshell, this RStudio tool allows us to select a directory where our project files live and effectively group these files and the work we do as a unit. At a basic level it simply helps manage individual projects more easily. As we move on to other posts in this series, and particularly when we discuss creating reproducible research, we will see that this feature will really prove its worth. For now, let’s make the project template an R project and turn to focus on the file and directory structure as it relates to doing efficient and reproducible data analysis in R.

To link our template to an R Project, start up R and select the ‘New Project…’ dialogue from the RStudio toolbar menu. You will be presented with various options as seen in Figure 2.


Options for creating an R Project in RStudio.

Figure 2: Options for creating an R Project in RStudio.

The first option is for starting a project from scratch. The last option is for cloning a project from a versioning repository like GitHub. With versioning software, like git, on our machine, we can clone and create an R Project in one step. We will make use of this option in future posts now that we have a basic understand of git and GitHub. For now, however, we want to link the project template we cloned manually using the command line, so select ‘Existing Directory’ from this menu.

Next navigate to the directory which we cloned either typing the path to the directory, or more conveniently using the ‘Browse’ button. Once we have selected the directory and create the R Project, RStudio will open a new session with our directories and files listed in the Files pane.


View of the project template as an R Project.

Figure 3: View of the project template as an R Project.

You will notice that RStudio has created a file named recipes-project_template.Rproj. From now on you can navigate to this file using your operating system’s file explorer and open it to return working on this project. Your workspace settings, history, environment variables, etc. will be restored to the last time you were working on the project –more on this later.

Saffolding for a scalable project

Now let’s turn to the files and directories of our project template and discuss how this structure is associated to the steps listed earlier to conduct a data science project. Below you will see the complete structure of the template.

├── README.md
├── _pipeline.R
├── code/
│   ├── acquire_data.R
│   ├── analyze_data.R
│   ├── curate_data.R
│   ├── generate_reports.R
│   └── transform_data.R
├── data/
│   ├── derived/
│   └── original/
├── figures/
├── functions/
├── log/
├── recipes-project_template.Rproj
└── report/
    ├── article.Rmd
    ├── bibliography.bib
    ├── slides.Rmd
    └── web.Rmd

Directories

This template includes directories for data (data), code (code), and communicating findings (report). These directories are core to your project and where the heavy lifting takes place. The data and report directories have important subdirectories that separate key stages in your analysis. data/original is where the data in its raw form will be stored and data/derived is where any changes you make to the data for the particular analysis are stored. This distinction is an important one; to safeguard our analysis and to ensure that our analysis is reproducible we do not want to make changes to the original data that are not reflected in the project itself. The subdirectories of report separate the potential formats that we may use to communicate insight generated from this analysis.

Before moving on to discuss the files included in the template, let’s discuss the other three supporting directories: figures, log, and functions. You will most likely generate figures in the course of the analysis. Grouping them together in the figures directory enables us to quickly reference them visually and also include them in any one or all of the reports that may be generated. The log directory is a convenient and easily identifiable place to document meta aspects of your analysis that will may not picture in your reports. Finally, a directory for housing custom functions you may write to facilitate particular stages of the analysis is provided, functions. We will soon see how powerful and indispensable custom functions are but for now just let me say that keeping them organized in a separate directory will enhance the legibility of your code and help you take full advantage of their power.

Files

This template also includes various file templates that are associated with the tasks typically performed in a data analysis project. The R scripts in the code/ directory are script templates to carry out the sub-tasks of our three main project steps: organize data; get the original data (acquire_data.R), clean and prepare key features of the data (curate_data.R), manipulate the data creating the needed variables and structure for the data analysis (transform_data.R), data analysis; visualize and perform statistical analyses on the data (analyze_data.R), and communicating findings; report results in appropriate formats (generate_reports.R).

Each of these R scripts has a common structure which is outlined using code commenting. Take a look at the structure of the acquire_data.R script, copied below:

# ABOUT -----------------------------------------------------------

# Description: 
# Usage: 
# Author: 
# Date: 

# SETUP -----------------------------------------------------------

# Script-specific options or packages

# RUN -------------------------------------------------------------

# Steps involved in acquiring and organizing the original data

# LOG -------------------------------------------------------------

# Any descriptives that will be helpful to understand the results of this
# script and how it contributes to the aims of the project

# CLEAN UP --------------------------------------------------------

# Remove all current environment variables
rm(list = ls())

The template shown here leverages code commenting to separate the script into meaningful tasks. The ABOUT section is where you will provide an overview of the purpose of the script in your project, how to use it, who created it, and when it was created. The SETUP section provides a space to load any required packages, source any required custom functions, and configure various other options. RUN is where the bulk of your code will be entered. As it has been stated various times, commenting is a vital part of sound coding practices. It is often helpful not only to comment the particular line of code, which we do by adding the # symbol to the immediate right of the code and then describe the task, but also to group coding sub-tasks in this section. RStudio provides a tool to create comment sections. You can use this tool by selecting ‘Code > Insert Section…’ or the keyboard shortcut shift + command + R (Mac) or shift + ctrl + R (PC). Either approach will trigger a dialogue box to enter the name of the section. Once you have entered the name it will then appear in the section listing, as seen in Figure 4.


View of the section listing in RStudio.

Figure 4: View of the section listing in RStudio.

You can use this listing to skip from section to section which can be very helpful as your script becomes more complicated with subsequent code.

The last two sections LOG and CLEANUP are good-housekeeping sections. LOG is where you can divert any meta-information about the analysis to the log/ directory. This is an important step to take at this point as the last section, CLEANUP, is where you will remove any of the objects your script has created with the rm(list = ls() command.

Although removing objects created is not strictly required, it has two key benefits: it will free up any memory that these objects claim from our R session and it helps keep each script as modular as possible. Freeing up memory after an object is no longer needed is good practice as memory handling in R is not one of the language’s strong points. Striving for modularity speaks more to reproducibility and analysis workflow. If a subsequent step relies on an object generated by a previous script that is held in memory, we must run the previous script in the workflow each time before the next. As the number of tasks increases and as these tasks become more processing intensive, it will lead to an unnecessary loss of computing resources and time. To avoid this scenario, each script in our analysis should only require input that is read from an session-external source; that is from a resource online or from the hard disk. This means that if an object created in a script will be needed a some point in our analysis, it should be written to disk –preferably in a plain-text version.1

The last of the three main steps in a data analysis project, ‘Reporting results’, is associated with the file generate_reports.R in the code/ directory which is tied to the various files in the report/ directory. These later files have the extension .Rmd, not .R. This distinction reflects the fact that these files are a special type of R script: an RMarkdown script. RMarkdown is a variant of the popular markup language Markdown. RMarkdown goes beyond standard Markdown documents in that they allow for the intermingling of code, prose, and graphics to dynamically create reports in PDF document format (report/article/report.Rmd), presentation slides (report/slides/presentation.Rmd), and interactive web pages (report/web/webpage.Rmd). Data and figures generated in by the R scripts in your analysis can be included in these documents along with citations and a corresponding bibliography sourced from a Bibtex file report/bibliography.bib.2 Together these features a provide powerful tool belt for creating publication quality reports. The generate_reports.R file simply runs the commands to render these files in their specific formats.

I have provided rough outlines for each of these RMarkdown output formats. We will explore the details of creating reports later in the series. But for now, I encourage you to browse the RMarkdown gallery and explore the documentation to get a sense of what RMarkdown can do.

The last two files in this template are the _pipeline.R script and the README.md document. README.md is the file used to describe the project in general terms including the purpose of the project, data sources, analysis methods, and other relevant information to help clarify how to reproduce the research with these project files. The README file may or may not include the .md extension. If it does, as in the example in this template, you will have access to the Markdown syntax options to provide word processing style formatting, if needed. If you end up storing you project on a code repository site, such as GitHub, this file will be rendered as a web document and be used as the introduction to your project.

The _pipeline.R script is the master script for your analysis. It is a standard R script and includes the same internal commenting sections as the other .R scripts in the the code/ directory (i.e. ABOUT, SETUP, RUN, LOG, and CLEANUP). This script, however, allows you to run the entire project from data to report in sequential order. In the RUN section you will find sub-steps which call the source() function on each of our processing scripts. Since each script representing a step in our analysis is modular, only the required input is read and output generated for each script. As logging step-specific information is taken care of in each particular script, the LOG section in the _pipeline.R script will most typically only include a call to the sessionInfo() function which reports details on the operating system and the packages and the versions of the packages used in the analysis. This information is vital for reproducing research as it documents the specific conditions that successfully generated the analysis.

Quick note: there is nothing special about the names of the files in the template. You can edit and modify these file names as you see fit. You should, however, take note of good file naming practices. Names should be descriptive and short. Whitespace is traditionally avoided, but is not explicitly required. I have employed ‘snake case’ here by using an underscore (_) to mark spaces in file names. There are various style conventions used to avoid whitespace and for other coding practices. I recommend following the suggestions provided by Hadley Wickham in his book ‘Advanced R’ (Wickham 2014). Whatever style you choose, the most important thing is to be consistent.

R Project sessions

Once the files and directories are linked to an R Project your workspace settings, command history, and objects can be saved at any point and restored to continue working on the analysis. To see this in action, let’s do a little work with this project template. Let’s run the _pipeline.R file. There isn’t much to our analysis at this point, as it is just boilerplate material for the most part, but it will serve to highlight how to work with an R Project session. To run this file, open it from the Files pane. It will appear in the Editor pane where we can now use the keyboard shortcut option + command + R (Mac) or alt + ctrl + R (PC) to run the entire master script.

Once you have run the _pipeline.R script, some new files will appear in your directory structure, seen below.

├── README.md
├── _pipeline.R
├── code/
│   ├── acquire_data.R
│   ├── analyze_data.R
│   ├── curate_data.R
│   ├── generate_reports.R
│   └── transform_data.R
├── data/
│   ├── derived
│   └── original
├── figures/
├── functions/
├── log/
│   └── session_info.txt
├── recipes-project_template.Rproj
└── report/
    ├── article.Rmd
    ├── article.html
    ├── bibliography.bib
    ├── slides.Rmd
    ├── slides.html
    ├── web.Rmd
    └── web.html

The new files appear in the log/ and report/ directories. The session_info.txt file is our log of the session information. The article.html, slides.html, and web.html files are the rendered versions of the RMarkdown templates. If you browse to the History tab in the Environment pane you will see that we have one line in our history –the code that ran the _pipeline.R file.

The article.Rmd file is set to render a web document by default in this template. If you want to render a PDF document, you will need to have a working Latex installation. For those who would like to set up PDF rendering of RMarkdown documents, here are instructions on how to set up Latex.

Let’s quit our R session now by closing RStudio. When prompted, choose ‘Save’ from the ‘Quit R session’ dialogue box. Now reopen our R project by either starting RStudio then choosing ‘File > Recent Projects’ in the RStudio toolbar or by navigating to the recipes-project_template.Rproj file with your operating system’s file explorer and double-clicking it.

Choosing to save our project before closing RStudio has the effect of taking a snapshot of the current workspace. The files that were open on closing the session are returned to the workspace. Any variables we had in memory and the command history are also returned. The details of this snapshot are stored in the files you will now find at the root of our project directory: .RData and .Rhistory.


View of the R project snapshot files `.RData` and `.Rhistory`.

Figure 5: View of the R project snapshot files .RData and .Rhistory.

You might be wondering why these files are prefixed with .. Using a period before file names is not specific to RStudio. It is a convention used in programming to hide a file from system file explorers. These types of files are often used for configuration and application resources and are not meant to be edited by the average user.

If you choose not to save the workspace when quitting RStudio, these files will not be generated, if they do not already exist, or they will not be overwritten by the current session if they already exist.

Round up

In this post we have discussed setting up and managing an R Project in RStudio. Along the way I provided a sample template for structuring your data analysis project based on the common steps in data science research. You have seen how this template is associated to each step and learned about some important conventions and guidelines for maintaining an efficient workflow. These principles are fundamental to creating an internally consistent and reproducible project.

Later on in the series we will discuss project versioning and packaging to make a project fully reproducible for you and future collaborators. We will leave that discussion for now and turn our attention in the next post in the series which addresses one of the main conceptual underpinnings of quantitative research: statistical thinking.

References

Wickham, H. 2014. Advanced R. CRC Press.


  1. Plain text files, in essence, are the lingua franca of the computing world. They are the type of files that can be readily accessed through a plain-text editor such as ‘TextEdit’ (Mac) or ‘Notepad’ (PC). Importantly these files are not compiled by nor bound to any particular software, such as a document generated by Word or Excel. We will see how to write objects to disk as plain text files in subsequent posts.

  2. You can generate your own Bibtex file or generate one using bibliographic management software such as Mendeley or Zotero

To leave a comment for the author, please follow the link and comment on their blog: R on francojc ⟲.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)